Writer Profile

Kiyonori Nagasaki
Faculty of Letters Professor, Department of Library and Information Science
Kiyonori Nagasaki
Faculty of Letters Professor, Department of Library and Information Science
2024/11/05
Introduction
A quarter of a century has passed since the trend of digitizing and publishing materials, known as digital archives (hereafter DA), began to spread. I feel that the number of people attempting to take on this responsibility is steadily increasing, as evidenced by the establishment of the Japan Society for Digital Archive, which provides a forum for practitioners and researchers to interact and explore the future of DA. While there are various definitions of DA, it is ultimately an endeavor centered on digitizing, publishing, and sharing various types of materials; therefore, the diversity of the people involved is as vast as the types of materials themselves. Even so, because a certain level of shared discussion is possible regarding digital technology and the legal systems surrounding it, various communities, including academic societies, are being formed and activities are being carried out based on these axes.
Looking at recent issues of the JSDA journal, Akihiko Takano's "Three Values of DA*1" reiterates the "Three Values: Important Roles of Digital Archives" presented in the "Japan Search Strategic Policy 2021-2025: Making Digital Archives Part of Everyday Life." These are (1) the succession and reconstruction of records and memories, (2) a common knowledge infrastructure that supports communities, and (3) the formation of new social networks. Due to space constraints, I will not go into detail, but I understand the values discussed here to have one important element: the contents of the knowledge infrastructure collaborate, and as a result, people also collaborate, leading to the formation of a better social network.
DA becomes a topic of conversation when high-profile content is released, but as a whole, the vast majority of it is inconspicuous, waiting for someone to discover it. It continues to be preserved and published for the possibility that someone will find value in it someday. In fact, even items that do not have great value individually may gain value by being aggregated, by being positioned within a community, or by collaborating with various other materials.
What is DA Collaboration?
The first things that come to mind regarding collaboration are portal sites that enable cross-searching of metadata, such as Japan Search and Europeana. In recent years, the possibility of even finer-grained collaboration has opened up. Here, I would like to briefly introduce the background and current status of this.
Fine-grained collaboration refers to a mechanism that makes it easier to add annotations from both inside and outside at a unit smaller than a single document or item in a DA, and to make the results of such intellectual work as sustainable as possible.
In order to add annotations from inside and outside, software to enable this is required. In the past, these were often special mechanisms prepared by companies, researchers, or developers. However, if annotations are made with custom-made software, both parties must install the same software if they want to collaborate with an external site. Furthermore, even if implemented independently, the results of the intellectual work of annotation will become unusable unless software from the same developer is continuously installed when system updates become necessary. Moreover, if that software is upgraded and loses compatibility with previous versions, or if development of the software ends, there is a possibility that the results will be lost. For example, the confusion caused by Adobe's discontinuation of Flash is still fresh in our memory. Compared to the fact that paper books, which are the results of intellectual work published in paper media, can be viewed almost anytime at the National Diet Library, it seems that intellectual work in DA is not yet in a state where we can feel secure about sustainability at the field level, regardless of advanced initiatives or theoretical discussions.
As a measure to avoid such situations, it is widely practiced in various fields to separate data and software and create data in a standard and open format. For example, the data formats for Microsoft Word, Excel, and PDF are published as international standards, and various software programs that can use the same formats are widespread. Standardizing data formats while making them public is an important factor for increasing sustainability by making data usable in various software and reducing dependence on specific individuals or companies.
DA Collaboration via IIIF
As a result of the widespread use of the Web, various contents can be viewed with a single piece of software called a Web browser across DA as a whole. Furthermore, in recent years, there has been a growing movement both domestically and internationally to standardize data formats in a way that better suits the appearance and content of DA. Of particular note is IIIF (International Image Interoperability Framework). What this standardizes is the ability to "specify partial positions or regions in various contents published on the Web using an internationally common data format."
This makes it possible, for example, to specify a location where a miniature has been cut out from a Western medieval manuscript published on one site, and display the image of the corresponding miniature published on another site perfectly aligned with the cut-out location on a Web browser. With this standard, it is also possible to extract only the relevant parts of information from different sites and combine them to create new content. A typical example is the famous facial expression collection, which extracts faces from art works of all times and places, centered on Japanese picture scrolls, and applies annotations to each. At the time of writing, 9,675 face images have been extracted from 108 works and are published as a dataset that anyone can use for research.
The spread of IIIF has made it possible to freely utilize Web content from around the world, thereby further expanding the potential to increase the value of Web content. In the early days, the expression "releasing from silos" was often used. It seems that such a standard was devised because individual Web contents were confined within their own sites, and collaborating them would incur high costs with no guarantee of success, while measures were sought to further increase the value of individual contents.
Internationally, IIIF has been adopted by the libraries of many leading universities in Europe and the United States for the Web publication of rare materials, and it has also been adopted by national libraries in several countries, including France, the UK, the US, and Germany. In Japan, organizations that publish large-scale content, such as the National Diet Library and the National Institute of Japanese Literature, have adopted it, so the number of IIIF-compliant contents in Japan is quite large. Incidentally, the Keio University Media Center also adopted IIIF, which seems to have been the first example among university libraries in Japan.
By publishing in compliance with IIIF, DA can increase the possibility of being given new value by freely linking content in various contexts, from individual items to the level of each part of the content. For details, please refer to "Digital Archives Opened by IIIF" (Bungaku Report), which we published this year.
TEI for Textual Materials
While IIIF is a standard for content collaboration regardless of the field, there are also various DA-related standards that increase utility by specializing in a field. Here, we focus on the TEI (Text Encoding Initiative) guidelines, a data format focused on the humanities, particularly text research. This is because textual materials such as classical books and ancient documents currently account for a large portion of DA, and if we are to consider their availability and collaboration potential, a standard that primarily targets such materials is useful.
The TEI guidelines were started in 1987 by a group of researchers mainly in the humanities and information science from Europe and the United States. Since then, for over 30 years, it has been supported by a community centered on humanities researchers to the present day. Currently, the TEI Technical Council leads the revision of the guidelines approximately once every six months.
Even within the field of textual research in the humanities, there are various research methods, and the points of focus vary accordingly. Even when looking at the same textual material, depending on the field or interest, one may be interested in external aspects such as the format of the material, the quality of the paper, or the typeface of the characters, or one may be interested in internal aspects such as the content of the text, the proper nouns that appear, or the part-of-speech information for each word. Creating a common data format in these diverse humanities is no easy task. Overcoming this to formulate a common format is what TEI aims for. This initiative not only applies digital technology or develops DA, but can also lead to discussions on methodology in the humanities, and is interesting as a cross-disciplinary initiative within the humanities.
The Problem of Multilingualism
Another element that has become important in the TEI community in recent years is the problem of multilingualism. Although there are many participants from outside the English-speaking world, the TEI guidelines themselves are written in English, and related discussions are also primarily conducted in English. Some point out that these guidelines implicitly assume the handling of materials in English. As a community, it is working on internationalization and multilingualization, and translations of the descriptions of tags and other elements into seven languages, including Japanese, have already been published. However, due to the volume and expertise required for the guidelines as a whole, no comprehensive translations have been published in recent years. The TEI community itself had never held an annual conference outside of Europe or the United States until it held one in Tokyo in 2018.
In the multilingualization of TEI, it is necessary to respond in terms of both content and practicality. On the practical side, Japanese translations of frequently used guidelines and tutorials are required. On the content side, it is difficult to apply the TEI guidelines, which were formulated assuming materials in Western languages, directly to Japanese classical books and ancient documents. Solving this challenge is not easy, but if it can be overcome, it will become possible to conduct cross-sectional analysis and share tools in a form compatible with many digitized textual materials in Europe and the United States. This can also contribute significantly to the utilization of research data, which is a major trend in current scholarly information distribution.
I began working on this around 2006, and ten years later, in 2016, we were able to establish the East Asian/Japanese Special Interest Group as the first subcommittee in this association to discuss a specific linguistic region. Based on the discussions in this subcommittee, and through discussions at annual conferences, with the Technical Council, and on GitHub, five years later in 2021, rules for ruby (furigana) frequently used in Japanese were added to the TEI guidelines*2. As for the trend of multilingualization, the movement is gradually strengthening, with the establishment of the Indian Texts Special Interest Group in 2017. The movement from Japan has also encouraged the movement of researchers related to India, and such matters seem to be a point where Japan can continue to contribute internationally, as a strength of Japan where the humanities developed relatively early in the non-Western world.
The utilization of TEI guidelines in DA has only just begun in Japan, and I look forward to its future expansion. In particular, for the many ancient documents and classical books whose images are published in DA—that is, materials in classical Chinese or cursive script (kuzushiji)—it is fully expected that general viewers may not understand the meaning even if they can read the characters, or may not be able to read the characters in the first place. It is desirable to be able to add text data or provide modern Japanese translations. Adding such new content to DA will also increase its value. For details on TEI, please refer to "Introduction to Text Data Construction for the Humanities" (Bungaku Report), which we published last year.
DA Collaboration through the Combination of TEI and IIIF
Regarding collaboration with images in particular, it is possible to link and display text with any part of a IIIF-compliant image in accordance with the TEI guidelines. For example, in the Iwashimizu-sha Uta-awase created in compliance with TEI, it is possible to read the text data of manuscripts published by the National Archives of Japan and Gunma University, and for parts where the two differ, the corresponding parts of the IIIF-compliant images can be displayed. In other words, without any further effort on the part of the publishers, the DA images published by each institution are being utilized independently by Waka literature researchers to provide value as an important element of academic content. Being able to check how something is written in the original manuscript with a single click—without going to see it in person or searching for the relevant part from the beginning—is far from the heavy and dense experience of going to see the materials in person. However, various new possibilities can be considered, such as being able to properly view materials in a slightly distant field with very little effort, or utilizing it as an entry point for education in such research methods.
An example of publishing DA images with not only the original classical text but also modern Japanese and English translations is the "Juban Mushi-awase Emaki" (Ten-round Insect Poetry Contest Scroll) published in March of this year. This also links TEI-compliant text to IIIF-compliant images, where the pictures in the scroll corresponding to the Waka are displayed, and furthermore, clicking on any of the three texts displays and highlights the corresponding part. Not only from a technical standpoint but also from a content standpoint, modern Japanese and English translations connect DA content to people who cannot read classical Japanese but can read modern Japanese, and to people who can read English, respectively. Technical collaboration also contributes to connecting people. People connected through this content may contribute to this field in some way in the future. If that happens, a virtuous cycle will be formed in which the technical and content aspects mutually enhance each other.
In this way, DA created in a standard data format can serve as a core that supports collaboration technically, in terms of content, and in terms of people. In the future construction and operation of DA, further promoting this direction will form a strong and rich foundation that encourages better sharing of knowledge and supports the formation of social networks.
*Affiliations and titles are as of the time of publication.