Why Does Digital History Need Diachronic Semantic Search?

Barbara McGillivray, Federico Nanni, and Kaspar Beelen

Searching for Meaning in Research

Situated at the intersection between historical research, library and information studies, information retrieval (IR), and natural language processing (NLP), searching digital heritage collections connects the humanities with computing. As large digital collections become available to scholars, the way they are searched is as important as ever and should be attuned to each discipline’s research practices. In this chapter, we focus on the practices of digital history research, and therefore our examples will be drawn from this discipline. In addition to the specific research practices adopted by the discipline under investigation, search of historical digital collections should take into account the longitudinal nature of the data (the fact that the data were gathered over a period of time) and the diachronic focus of the research questions (the study of the evolution of a phenomenon over time, which is a major area of interest in historical research). The affordances of search shape the scholarly interaction with (and the interpretation of) historical collections: what cannot be searched for and be found. In this sense, it is critical to understand how the activity of search is bounded by the digital infrastructure on which it operates. In this chapter, we raise the debate on how search might be conceived and approached differently to meet the needs of digital historians. We do so by focusing on the broad and intangible but omnipresent category of diachronic semantic search—that is, search that is centered around meaning as it changes over time. In particular, we investigate how this highlights specific aspects of a corpus (compared to traditional keyword-based systems) or casts shadows on other parts of a digital collection. Semantic search lets us reach the level of meaning of a text (or at least of its words), thus allowing historians to conduct their research without reducing the concepts and ideas they intend to explore to mere combinations of keywords, which would allow you to find relevant documents even if they do not explicitly mention the terms typed. A mismatch between keywords and documents is a well-known challenge in digital history and information retrieval as a whole. It would, for instance, occur when searching for diachronically sensitive words—for example, the term partigiani (partisans) in the historical archive of the Italian newspaper La Stampa. This search would lead to very few results in the entire period of the Second World War because the newspaper was aligned with the government of the National Fascist Party, thus they were instead using words such as “bandits” or “communists” when describing specific events.1 False positives are another problem often encountered with traditional string-based search systems. As many words are polysemous, a mismatch can appear between the meaning projected on a query and the results it returns.

While historical documents from digital collections can typically be searched at a superficial level (through character- or string-matching) or through keywords, semantically enhanced exploratory search systems are not yet available or are still highly constrained. This is critical, as many words are polysemous (i.e., have more than one meaning). For example, “depression” can refer to a mental health condition or to a lowering of the atmospheric pressure. If we search for the string “depression” in a collection with the aim to study the mental health condition, we are presented with instances of both meanings, and we need to read each context to tease them apart. For small collections, this may be time-consuming but feasible, but for larger collections, it is likely to be a completely impossible task. Moreover, current search systems do not provide chronological depth because they do not allow users to retrieve meaning information that is time-sensitive. This is an important gap when dealing with historical texts, as the meaning of words is constantly subject to change. Take, for example, the word “tablet,” which has traditionally meant “a smooth stiff sheet for writing on” (Oxford English Dictionary, entry “tablet, n.”), but since the beginning of the twenty-first century, the word now refers to “a small portable computer in the form of a flat tablet” (Oxford English Dictionary, entry “tablet, n.”). Diachronic semantic search would allow us to identify the different meanings of a search term like “tablet” and connect them with their textual instances.

The semantic web community has provided definitions of semantic search (Guha, McCool, and Miller). In information retrieval contexts, this term broadly refers to approaches for retrieving and organizing documents in ways that go beyond the superficial level of strings and tokens. In this chapter, the term “semantic search” refers to a framework for navigating text collections that draws on information derived from (textual) data enrichment. Text enrichment adds meaning by bringing in external information from knowledge bases or surfacing and embedding information in a shared semantic space. This information can be expressed in multiple forms: from automatically generated metadata (i.e., genre classification) to entity linking (i.e., links between named entities such as names of people or places and an external knowledge base, for instance, Wikidata) or distributional semantics (i.e., the set of methods that extract meaning information from word usage statistics in texts).

While understanding the central meaning of a user query has been a long-term ambition of information retrieval systems and digital libraries (Croft et al., “Query Representation and Understanding Workshop”), addressed over the decades through various models of the latent semantics of an information need (Deerwester et al., “Indexing”; Mitra and Craswell, “Neural Models,” among others), search becomes more complex in digital humanities contexts, where profound semantic changes between the context of the researchers posing a query and the period when the corpus was created play a major role. For example, the concept of nation changed from an economic category to a political and cultural category during the nineteenth century. This concept, which encompasses a sense of community but also geographical proximity and shared cultural values, is captured by many more words and expressions than just the word “nation,” and this has changed over time (Hengchen, Ros, and Marjanen, “A Data-Driven Approach to the Changing Vocabulary”). It is currently impossible to find instances of a concept like nation in texts from a certain historical period without specifying all relevant words and at the same time considering that certain words have changed their meaning in time. Recent NLP research has shown the potential of computational models of semantic change that complement manual approaches (cf. Kutuzov et al., “Diachronic Word Embeddings and Semantic Shifts”; Tahmasebi et al., “Survey of Computational Approaches to Lexical Semantic Change”). However, such systems are still far from addressing the needs of humanities researchers, in spite of some attempts at addressing this (e.g., McGillivray et al., “A Computational Approach to Lexical Polysemy in Ancient Greek”). This situation is, to some extent, similar to the disconnect between NLP research and literary studies, highlighted in David Bamman’s chapter, “Born Literary Natural Language Processing,” in this volume. We believe that research in computational semantics will benefit from a closer interaction with digital historians with the aim of improving current search capabilities.

This chapter opens a debate on the importance of diachronic semantic search for digital history. We argue that this is a very challenging task to address for the digital history field and a problem that should not be underestimated by historians. We do not aim to offer a comprehensive overview of the field. For this reason, we offer examples for each issue we discuss, and we present them as suggestions to the reader with the aim to offer a way to dig deeper into the various aspects of our argumentation. We believe it is necessary for digital history to focus on improving computational search systems to efficiently explore large diachronic and longitudinal text collections; therefore, we present a series of recommendations for future projects. We start by offering an overview of the challenges that diachronic semantic search poses for digital humanities research, from identifying shifts in word meaning to the difficulties of deriving diachronic representations of a domain of knowledge.

We argue that search in the humanities is more than document retrieval: it is intrinsically intertwined with a process of sense-making and interpretation and therefore interfaces with hermeneutics (Putnam, “The Transnational and the Text-Searchable”; Guldi, “Critical Search”). Among other things, search assists historians with demarcating the boundaries of a concept, and there is a long tradition of defining, tracing, and analyzing concepts in history, led by many prominent scholars, including the History of Concepts Group.2 The digital affordance (i.e., the means by which humanists can explore and navigate an archive) influences their interpretation and results. When confronted with large collections, the scholar’s information need is often vaguely defined. Similarly, the target of the search emerges in conversation with historical documents. As the initial query forms a rough approximation of a more general concept, semantically enhanced search can foreground relations between documents, capture how users understand specific concepts, or extend and refine the scope of a search operation. Alex Olieman and colleagues (“Good Applications for Crummy Entity Linkers?”), for example, show how these issues play a role when searching for documents that relate to (i.e., refer to) historical periods such as the Dutch Golden Age, which is a container collating various persons and events into an overarching colligatory concept. Semantic search can assist the scholar with articulating what the concept comprises by showing potentially relevant entities (based on relations as encoded by a knowledge base), which can then guide the search and allow the user to demarcate their information needs by (de)selecting entities and documents.

In this chapter, we also focus on the importance of semantic change for retrieving historical documents, as semantics resides in the corpus as well as in the language of the query itself. All aspects of language are subject to change over time, of course, and not just semantics. Spelling, morphology, syntax, for example, are elements that tend to change at different rates, and this change needs to be accounted for in search. In this chapter, we focus on semantic change (or lexical semantic change) in words, because we believe this has particularly profound implications for the role of search in historical research. Our proposal advocates for search engines that can meet the needs of historical research. Unlike commercial search engines, which are optimized for navigation and precision of retrieval, historical practice requires more recall-oriented results, suitable for poorly defined and variable information needs, often without an initial knowledge of how the desired phenomenon is described in a historical period. The practice of search for historians, therefore, needs specific design and intentional tool building. This perspective leads us to explore issues about the overall epistemology of search and its distinction from a collection of specific technologies. While many of the other essays in this volume reflect the systems, collaborations, and institutional structures that support digital humanities (DH) work, we also reflect on what would be needed to support diachronic semantic search. We believe our suggestions are relevant especially when read in the context of other chapters in this book that touch on the usage or development of a search interface, such as “Freedom on the Move and Ethical Challenges in the Digital History of Slavery” by Vanessa Holden and Joshua D. Rothman in this volume. Their interface showcases the benefits and affordances of semantic search. By including semantic enrichments (as conscientiously recorded by volunteers), users of the database can explore the collection of runaway advertisements in nuanced and complex ways that go beyond simple keyword search. For example, they can include metadata about the ad, as well as information about the enslaved individual (or enslaver), to find content and study the language and history of slavery. In this sense, semantic search provides a technique to traverse historical data based on the rich and multifaceted biographical information embedded in runaway ads. Moreover, diachronic semantic search could also provide suggestions for query terms that are specific to the historical period and unfamiliar to us today.

Why Search?

Search is a common practice in historical scholarship, which is usually based on identifying sources in archives. The process of searching and the archives themselves have changed substantially in the last forty years, with the advent of the personal computer, then the internet and the web, Google search, digital libraries, and the Internet Archive (Graham, Milligan, and Weingart, “Exploring Big Historical Data”; Story et al., “History’s Future”). Nevertheless, while the way that today’s historians access sources is almost always mediated by a search tool, a digital archive, an advanced search interface, or complex queries, this topic tends to be discussed only marginally in digital history.

While historians tend to prefer “browsing” (digital) sources (Allen and Sieczkiewicz, “How Historians Use Historical Newspapers”), methodologically, the use of keyword search has become an accepted, frequent, and almost unremarkable activity (as opposed to, for example, the use of text mining methods that are currently practiced only by a very small number of historians). The debate on macro/quantitative versus micro/qualitative approaches usually receives more attention, even though few historians engage with data mining and many use search boxes to craft their argument. See, for instance, how often topics such as culturomics (Michel et al., “Quantitative Analysis of Culture”) and distant reading (Moretti) are mentioned as part of the Debates in the Digital Humanities series, compared to discussions related to information retrieval.

The act of search, which is different from the act of modeling (e.g., plotting word-frequency over time), generally remains in the background.3 A few papers have, however, explicitly questioned the impact of search on historical scholarship. Digital search opens, according to Lara Putnam (“The Transnational and the Text-Searchable”), “shortcuts that enable ignorance as well as knowledge” because the ease with which scholars can now interrogate online collections disconnects transnational history “from place-based research practices that have been central to our discipline’s epistemology and ethics alike” (Putnam, 379). The speed and granularity enabled by full-text indexing facilitates discovery but risks the loss or underappreciation of contextual knowledge needed to properly interpret the found materials. Putnam is not as much concerned with technicalities of search as with how the general affordances of online access to historical sources can shape research practices for better or worse. Nonetheless, a major distinction with archival search—as is often repeated—is that historians can now “find without knowing where to look” (Putnam, 377). An interesting case, in this respect, is Osmond (2015), who, following Featherstone (2013), characterizes search as “chasing snippets and shadows.” He shows how keyword queries combined with a “queer eye” allows him to trace often implicit mentions and representations of homosexuality in the sporting press before the 1950s. Also reflecting on the digitized press, Brake (“Half Full and Half Empty”) emphasizes how the affordances of the interfaces mediate and structure access to databases, noticing how they mostly invite users to navigate a collection using keyword search but are less amenable to browsing series or titles. As these studies show, retrieving information is a critical part of historical analysis, and digitization is proving increasingly influential. Understanding the process by which historians collect materials from an archive is fascinating because it has at least three layers of complexity, which are the focus of the remainder of this section.

First, historians are interested in complex, abstract, and (often) latent concepts. This contrasts with fact-oriented queries that we are used to from popular web search engines like Google. For instance, historians might be interested in concepts that do not appear in the textual content because it was illegal to mention them or that are referenced differently based on the political affiliation or the cultural background of the writer. Clearly, this type of research is not adequately addressed by the string-based search functionality currently available in archives. It is therefore particularly important to allow historians to access a more advanced search that can access the level of meaning of the texts. This is what we refer to as semantic search. Second, historians work with historical collections, which typically span over a long period of time. As language changes constantly, the change in the meanings of words is a crucially important factor to be accounted for whenever texts are searched, analyzed, and interpreted. This phenomenon is called semantic change in linguistics and concept drift in other disciplines.

Finally, aware of this semantic shift, when retrieving information from an archive, historians attempt (as much as they can) to adapt their query to the period under investigation and the perceived contemporaneous meanings. Nevertheless, the meaning of words today still plays a relevant role in the way queries are formulated and documents are interpreted, especially to assess the relevance of the retrieved results. Scholars might say that an expert should simply know their subject and its contextual meanings; this, we argue here, is only partially true, and diachronic semantic search approaches could assist the expert in challenging their assumptions while exploring a collection. This approach is in line with what Hannah Ringler describes in this book in the chapter on computation and hermeneutics: Semantic search is a way of “asking questions with” computational methods and, by interacting with the results of the tool, a way of “asking questions about” computational methods. Nevertheless, it is important to keep in mind that searching is a completely different activity from reading. What search lacks in deep immersive understanding is gained in processing speed. Therefore, when discussing search, the focus should not only be on how a computational method is applied to a collection but especially how the same method helps the user define the query. Moreover, query rewrite logs are one of the best signals used by search engines for disambiguation and fine-tuning. As users, we often end up writing a query on Google repeatedly to point the tool in the direction we want to go. This is helpful because it guides us to define more explicitly what we are looking for.

Approaches toward Semantic Search

Capturing the meaning expressed in textual materials has been one of the overarching ambitions of researchers in NLP, IR, semantic web, and DH. This would allow, among many other things, retrieving information relevant to a user query beyond simple string matching: for example, as mentioned previously, Olieman and colleagues (“Good Applications for Crummy Entity Linkers?”) attempt to find entities related to the concept of the Dutch Golden Age, such as painters and buildings related to this period. But there are other strategies to move from the textual to the semantic level. Here, we describe what we see as three core approaches for modeling (word) semantics that have emerged across the research community over the last decades.

The first set of approaches includes initiatives for enriching a collection of documents with layers of (semantic) meta-information. Since its origins, the DH community has deeply engaged in such approaches. This is exemplified by the body of work related to the Text Encoding Initiative (TEI), whose major accomplishment has been the creation of a framework for enriching documents with extensible markup language (XML) schemas (Cummings). These layers of semantic annotation are often modeled to encode the latent or implicit structure of a document and allow researchers to formulate more complex queries, searching only through specific parts of the document. However, manually encoding documents is labor intensive, and automatic annotation of documents requires a significant computational effort. Another form of enrichment results from the application of automatic tagging, which is often implemented by NLP tools that generate morpho-syntactic annotations (i.e., focused both on parts of speech and words, such as nouns and verbs, for example, and the structure of sentences). Linguistic annotation allows scholars to disambiguate and refine their queries—for example, by defining a combination of tokens and part-of-speech tags such as “book [VERB].” Newer interfaces such as Nederlab (Brugman et al.) allow researchers to create detailed query patterns to search diachronic heritage collections.

A core aspect of these approaches to modeling semantics is that they rely on the existence of a structured representation of a specific domain of knowledge. For this reason, such research is highly intertwined with the advent and affirmation of the semantic web and the overall goal of a computationally processable way of representing knowledge (Berners-Lee, Hendler, and Lassila). For instance, the Digging into Linked Parliamentary Data project offers a combination of a structured knowledge base and documents semantically enriched with information derived from the knowledge base itself.4 This project tackled the double goal of semantically encoding implicitly structured documents (such as debates in parliament, which follow a distinct pattern of speech turns) and linking them to relevant knowledge bases (such as DBpedia, but also domain-specific parliamentary databases).

A related research area aims at automatically enriching documents with links to the information contained in the knowledge bases. Such efforts, embodied, for instance, in so-called entity-linker tools (Shen, Wang, and Han, “Entity Linking with a Knowledge Base”) have reached the digital humanities community.5 Such approaches allow the retrieval of occurrences of a specific entry from a knowledge base (containing, for instance, people or places) in the corpus. Moreover, these approaches make it possible to execute more complex structured queries—for instance, to find colligatory concepts such as the French Revolution that encompass a specific set of people, places, and events. WideNet (Olieman et al.) exploits the structure between entities to generate queries that aim to capture a diverse set of entities related to a concept and retrieve relevant documents. However, in addition to the focus on names of entities such as people and places rather than common nouns to denote concepts, applying knowledge bases risks introducing contemporary biases to historical data. For instance, some people were largely unknown to contemporaries and only acquired fame posthumously, which risks inducing a certain recency bias in our knowledge bases by overemphasizing information that is important from a present-day (but not necessarily from a historical) perspective. Entity linking, when not properly trained and evaluated, can amplify such biases and project them on historical data by, as an example, erroneously associating the string “Van Gogh” with the (not then famous) painter Vincent. Other projects, such as Impresso (Ehrmann et al., “Introducing the CLEF 2020 HIPE Shared Task”), have made substantial improvements regarding historical entity linking for newspaper data.6

The second group of approaches to modeling semantics that we consider here is from the information retrieval community (Schütze, Manning, and Raghavan). Instead of relying on a structured representation of knowledge and a collection where occurrences of entities have been linked to a knowledge base, such approaches mostly focus on better modeling the information needed by the user by expanding the query with additional semantic information. Here we highlight a few ways in which the IR community tackles this problem. One way, which bridges information retrieval and semantic web approaches, involves enriching the user query with entity links in the same way that the corpus has been previously enriched (Dalton, Dietz, and Allan, “Entity Query Feature Expansion Using Knowledge Base Links”) and in line with the things-not-strings Google approach.7 A different way of capturing/expanding the meaning of the query would be to have a so-called implicit reference feedback from the corpus: retrieving an initial set of documents that are relevant to the query, identifying a series of frequently occurring words from the set, and running a second query on the corpus (Xu and Croft, “Query Expansion Using Local and Global Document Analysis”). A final way would be to keep the user in the loop and ask for further specification on the initial query, both through a “Do you mean this?” prompt and through a filtering process of possibly related words, entities, or concepts.

A final approach of modeling semantics and allowing the user to retrieve information from a collection of documents relies on current advances in NLP, specifically concerning distributional semantics approaches. Capturing the underlying semantic similarity across documents has been a core area of research between NLP and IR. Such efforts have focused on latent semantic indexing (Hofmann, “Probabilistic Latent Semantic Indexing”) and have led to the development of topic modeling algorithms, among which one of the most widely used is latent Dirichlet allocation (Blei, Ng, and Jordan). The development of compact vector representations (so-called word embeddings) that capture the meaning of words, sentences, or documents (Mikolov et al., “Distributed Representations”; Devlin et al.) has allowed for the retrieval of documents based on measuring the semantic similarity between a query and the content of documents (Mitra and Craswell, “Neural Models”; Beelen et al., “When Time Makes Sense”). The advent of such representation learning approaches has been a major breakthrough for context-sensitive and semantic search.8 In particular, these approaches allow us to go beyond matching a user query with content, without the need for an external knowledge base or a relevance feedback involving the user. However, compared to the approaches described above, which are explicit in highlighting the similarities, the most important drawback of distributional semantics approaches is their lack of interpretability, because it is very difficult for the user to determine why their query is judged by the system to be semantically similar to the retrieved documents. As all three of these approaches demonstrate, semantic search can be achieved in different ways, each with a set of strengths and drawbacks. In the next section, we touch on the role played by time in it.

When Does Time Play a Role?

It is well known that language is a dynamic system (Traugott, “On the Persistence of Ambiguous Linguistic Contexts over Time”), and, like many other aspects of language, word meaning changes with time. Words associated with certain concepts may be different in different time periods. To take a classic example, let us consider the English adjective gay. It originally meant “happy” and “joyful,” but in the twentieth century, it started being used to refer to people who are attracted to people who identify as their same gender identity, and this is now its predominant current meaning.9 This change in word meaning is referred to by linguists as lexical semantic change, and it is a complex phenomenon. The original meaning of a word may coexist with a newer one, and sometimes certain meanings are limited to (or created in) particular sociopolitical and cultural contexts.10 Nowadays, we can expect to find the original bird-related meaning of tweet as “chirp of small birds” in texts about birds, but in general the more recent meaning, “a post on the social media platform X (formerly known as Twitter),” is more common in other contexts. Lexical polysemy can take various shapes.11 A word may have a concrete and an abstract meaning, related metaphorically to each other (cf. “grasp,” meaning “seize” in the physical domain and “understand” in the mental domain), or a word may have opposite polarity values (e.g., “sick” as in “ill” has acquired the positive meaning of “excellent” since the early 1980s).

Word meaning change results from a series of interconnected linguistic, cognitive, social, cultural, and contextual factors, and it is often led by innovative language users and communities (Traugott, “On the Persistence of Ambiguous Linguistic Contexts”; Croft, Explaining Language Change; Andersen, “Markedness and the Theory of Linguistic Change”). All these complex phenomena affecting word meaning are not just relevant to (historical) linguistics research, but they have a direct effect on the effectiveness of search in historical research and on the expectations that researchers pose on it. For example, we often do not know which words were used in the past to describe feelings or phenomena, which makes it particularly hard to trace the linguistic expressions of a concept over time. Because search has always had an implicit temporal perspective, the crucial questions become how to make this perspective explicit, and how to connect the temporal anchoring of a query to the temporal aspects of the document.12

Two related but different semantic change scenarios can have an impact on the process of searching through a historical archive:

Modeling the change in meaning of words (e.g., “gay”)
Modeling the change in linguistic expressions of the underlying concepts (e.g., “happiness-wellbeing-eudaemonia”)

In the first scenario, if we take the perspective of a user performing a search for the word “happy,” we can assume that their understanding of the meaning of this word corresponds to its most common contemporary definition.13 In this case, we would like to find all contemporary documents containing the string “happy” or its morphological inflections “happier” and “happiest” and also contemporary documents containing synonyms of the term, such as “content” or “cheerful.” This requires that the users specify all these synonyms in their query to retrieve the relevant texts. If the search involves historical documents (from the eighteenth century, for example), it would be desirable for the search engine to return documents containing words that meant “happy” at the time—for example, “gay.” For this to happen, it is critical that the change in the meaning of search terms is accounted for in the search engine. Currently, however, the user needs to be fully informed about which synonyms were appropriate in each time period under investigation, which is only partially possible thanks to historical dictionaries, as these resources rely on often large but necessarily limited textual evidence.

In the second scenario, if we want to find documents containing instances of a semantic concept, we need to be aware of the different linguistic expressions that can be used to express that concept. To take the same example used above, the adjective “gay” can be associated to the concept of happiness in texts dated up to the early twentieth century. Starting from this point in time, gay acquired the meaning of “homosexual” alongside the original meaning of “cheerful,” and therefore if our search involves texts from the twentieth or twenty-first century, this new meaning should be taken into account. In this case, what we would need is a way to associate words to concepts and to do that over the different time periods we are interested in.

Current approaches to semantic search are still limited in the two aforementioned scenarios. The semantic representation approaches that rely on manually encoding semantic information in the texts are not easily scalable. Manual semantic annotation is a very time-consuming task and cannot be done for all the texts that researchers may want to search. Some semantic resources like dictionaries and thesauri (the Oxford English Dictionary or the Historical Thesaurus of English, for example) map words’ meanings to the historical periods in which they were in use, but most knowledge bases do not have this diachronic aspect. NLP research has led to various new approaches to representing words’ changing semantics using word embeddings (Hamilton, Lescovec, and Jurasfky, “Cultural Shift or Linguistic Drift?”; Kutuzov et al., “Diachronic Word Embeddings and Semantic Shifts”). Another line of research has focused on identifying lexical semantic change from large historical corpora (Tahmasebi, Borin, and Jatowt, “Survey of Computational Approaches to Lexical Semantic Change”). In particular, there are rare examples of information retrieval approaches that rely on word embedding representations and have developed basic diachronic search engines such as the Diachronic Explorer for Spanish (Gamallo, Rodríguez-Torres, and Garcia, “Distributional Semantics for Diachronic Search”). However, these are still far from being accurate and detailed enough to satisfy the needs of historical research. In the next section, we present a series of recommendations that will hopefully help further this research.

Our Vision: Diachronic Semantic Search

Through the overview we presented above, our intention was to shed some light on the challenges lying ahead for digital historians who intend to develop systems for diachronic semantic search. This is especially true when searching for information on broad, complex, and abstract topics. We conclude this chapter by highlighting a few ways forward that can help researchers appreciate the extent of these challenges, while at the same time developing approaches for addressing the task.

What is needed in order to develop a diachronic semantic change system? First, a knowledge representation that captures information in the best possible way from the perspective of the period under study. Second, we need historical distributional word vectors that represent historical language use in the form of a geometrical space (a vector space). These two elements are essential because it is only through their combination that it would become possible to model both well-known sociopolitical changes (a city changing its name—for instance, Leningrad) as well as more latent long-term cultural variations (see, for instance, Light, “When Computers Were Women”). However, obtaining such resources presents multiple challenges across the spectrum of current and, as we argued, future computational humanities research.

For instance, distributional representations of the meaning of words can be derived directly from large collections. However, assessing their quality requires setting up a complex procedure to evaluate the type of semantic information captured by the model (e.g., similarity versus relatedness of meaning) and to quantify the distinction between changes in meaning and variations in language usage (Erk, McCarthy, and Gaylord, “Investigations on Word Senses and Word Usages”; Gonen et al., “Simple, Interpretable and Stable Method for Detecting Words”). Such evaluation strongly relies on the expertise of the language and the culture of the period under study. A structured knowledge representation of the period (or periods) under study will be even more difficult to obtain. In fact, while diachronic dictionaries and gazetteers exist, their coverage is necessarily just partial. For instance, gazetteers often lack relational information between entries (e.g., Florence as the capital of the Reign of Italy between 1865 and 1867). At the same time, the large majority of current knowledge graphs and entity linking tools are developed on the type of information available on Wikipedia, ranging from info-boxes to textual context around links, from redirect pages to categories as structures. This therefore limits the possibility of plugging in a pipeline that works in a contemporary setting unless the same type of knowledge information on the period under study is offered as input—for example, by having a Wikipedia of the nineteenth century. While efforts in this direction are complex and will require a long time to be implemented, we are already aware of the limitations of contemporary knowledge graph tools when applied to non-twenty-first-century mainstream sources.14

Further, while we need a way to assess the overall reliability of this semantic retrieval system, doing this in the context of historical research is challenging because of the complexity and abstract nature of the queries and the subjective interpretation that each historian will bring when assessing the relevance of a retrieved document based on a specific query. For this reason, we recommend that researchers move away from a pure ranking-based information retrieval paradigm (as a matter of fact, search results are much more than a ranking) and embrace different ways to establish the relevance of the obtained results. For instance, we can consider exploratory faceted search.15 You can also consider active learning, which is a machine learning setting where the user interacts with the algorithm by iteratively labeling new data points with the desired outputs to improve the performance. These are closer to the way a historian would normally interact with a collection: first through a broad overview and then further and further into a specific topic. While the performance of an end-to-end diachronic semantic search system may be too complex to evaluate in its entirety, especially when the users are domain experts, the usefulness of each component could still be measured. For instance, we could assess how diverse the selection of an exploratory interface is or how well an active learning system picks up the signal of the user. Digital interfaces for historical research need to balance user requirements with technical constraints. Establishing the equilibrium between these aspects remains the topic of ongoing research.

The complexity of offering diachronic semantic search in a digital history context should not be underestimated and, as a community, we should not follow a data-driven paradigm that reduces context to the creation of a gold standard or a benchmark averaging the opinion of experts (McGillivray et al., “Digital Humanities and Natural Language Processing”). And, even more importantly, this does not apply only to digital history or computational humanities but all across the academic spectrum, to any discipline that relies on the use of computational methods for conducting research. As stressed by Catherine D’Ignazio and Laurent Klein (Data Feminism, chap. 6), “data are not neutral or objective. They are the products of unequal social relations, and this context is essential for conducting accurate ethical analysis.” So, instead of ignoring, reducing, or simplifying the context, we should benefit from its complexity in order to build tools and systems that will bring us closer to the full experience of search that we envision. The analysis of meaning in digital resources is neither a purely algorithmic nor a solely manual task. Instead, we argue that it is situated at the interface of human and machine reading; it arises through a complex interaction between interpretive strategies and semantic technologies. And, as we believe, it will flourish as a result.

Notes

All authors reviewed and approved the final version of this essay. They have joint responsibility for the first and last sections. Kaspar Beelen has primary responsibility for the “Why Search?” section, Federico Nanni for “Approaches toward Semantic Search,” and Barbara McGillivray for the section “When Does Time Play a Role?”

1. See the last example at “Frequently Asked Questions,” La Stampa. Accessed January 15, 2024. http://www.archiviolastampa.it/component/option,com_lastampa/task,faq/Itemid,4/.
2. https://www.historyofconcepts.net.
3. There are some notable exceptions (e.g., Underwood, “Theorizing Research Practices,” and Putnam, “Transnational and the Text-Searchable”).
4. The archives for “Digging into Linked Parliamentary Data” are available at https://blog.history.ac.uk/tag/digging-into-linked-parliamentary-data/ (last accessed January 15, 2024).
5. See also Rovera et al., “Domain-Specific Named Entity Disambiguation”; Ardanuy and Sporleder, “Toponym Disambiguation”; Munnelly and Lawless, “Investigating Entity Linking”; McDonough, Moncla, and van de Camp, “Named Entity Recognition.”
6. Information on the Impresso project and Impresso app are available at https://impresso-project.ch/ (last accessed January 15, 2024).
7. See Amit Singhal, “Introducing the Knowledge Graph: Things, Not Strings,” The Keyword (blog), May 16, 2012, https://www.blog.google/products/search/introducing-knowledge-graph-things-not/ (last accessed January 15, 2024). Regarding the broader role of Google in the information retrieval community, a good starting point is Brooks’s 2003 article on “Web Search: How the Web Has Changed Information Retrieval.”
8. See also the advent of a new “deep learning” evaluation track in 2019 at TREC, the Text REtrieval Conference (Craswell et al.).
9. Cf., for example, “gay, adj., adv., and n.” at Oxford English Dictionary (OED) Online, accessed July 14, 2020, https://oed.com/view/Entry/77207?rskey=JyQmyU&result=1&isAdvanced=false.
10. Consider, for instance, the communication on social media in order to avoid censorship; see https://www.amnesty.org/en/latest/news/2020/03/china-social-media-language-government-censorship-covid/.
11. See Bréal, “Les lois intellectuelles du langage”; Ullmann, Semantics; Blank, “Why Do New Meanings Occur?”; Koch, “Meaning Change and Semantic Shifts.”
12. The structure and classification of archives also change over time. This point is outside the scope of the present chapter, but it is an important element to be considered.
13. See https://www.dictionary.com/browse/happy (last accessed January 15, 2024).
14. As discussed in Ardanuy and Sporleder, “Toponym Disambiguation”; Olieman et al., “Good Applications for Crummy Entity Linkers?”; Rovera et al., “Domain-Specific Named Entity Disambiguation”; Munnelly and Lawless, “Investigating Entity Linking”; McDonough, Moncla, and van de Camp, “Named Entity Recognition Goes to Old Regime France”; Ardanuy et al., “A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching,” among others.
15. A good example of faceted search tailored toward the research needs of historians and the content of the database is “Freedom on the Move” (https://freedomonthemove.org). This interface allows users to formulate a detailed search based on recurrent descriptions of fugitives and enslavers in American “runaway ads.” See also chapter 11 in this volume.

Bibliography

Allen, R. B., and R. Sieczkiewicz. “How Historians Use Historical Newspapers.” Proceedings of the American Society for Information Science and Technology 47, no. 1 (2010): 1–4.
Andersen, H. “Markedness and the Theory of Linguistic Change.” In Actualization: Linguistic Change in Progress, edited by H. Andersen, 21–57. Amsterdam: John Benjamins, 2001.
Ardanuy, M. C., and C. Sporleder. “Toponym Disambiguation in Historical Documents Using Semantic and Geographic Features.” In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage, 175–80. New York: Association for Computing Machinery, 2017.
Ardanuy, M. C., K. Hosseini, K. McDonough, A. Krause, D. van Strien, and F. A. Nanni. “Deep Learning Approach to Geographical Candidate Selection through Toponym Matching.” In Proceedings of the 28th International Conference on Advances in Geographic Information Systems, 385–88. New York: Association for Computing Machinery, 2020.
Beelen, K., F. Nanni, M. C. Ardanuy, K. Hosseini, G. Tolfo, and B. McGillivray. “When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation.” In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2751–61. Association for Computational Linguistics, 2021.
Berners-Lee, T., J. Hendler, and O. Lassila. “The Semantic Web.” Scientific American 284, no. 5 (2001): 34–43.
Blank, A. “Why Do New Meanings Occur? A Cognitive Typology of the Motivations for Lexical Semantic Change.” Historical Semantics and Cognition 13, no. 6 (1999).
Blei, D. M., A. Y. Ng, and M. I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022.
Brake, L., “Half Full and Half Empty.” Journal of Victorian Culture 17, no. 2 (2012): 222–29.
Bréal, M. “Les lois intellectuelles du langage: fragment de sémantique.” Annuaire de l’Association pour l’Encouragement des Études Grecques en France 17 (1883): 132–42.
Brooks, T. A. “Web Search: How the Web Has Changed Information Retrieval.” Information Research 8, no. 3 (2003).
Brugman, H., M. Reynaert, N. van der Sijs, R. van Stipriaan, E. T. K. Sang, and A. van den Bosch. “Nederlab: Towards a Single Portal and Research Environment for Diachronic Dutch Text Corpora.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1277–81. Göttingen: 2016.
Craswell, N., B. Mitra, E. Yilmaz, D. Campos, and E. M. Voorhees. “Overview of the TREC 2019 Deep Learning Track.” ArXiv:2003.07820 (2020).
Croft, W. Explaining Language Change: An Evolutionary Approach. Harlow, Essex: Longman, 2000.
Croft, W. B., M. Bendersky, H. Li, and G. Xu. “Query Representation and Understanding Workshop.” SIGIR Forum 44, no. 2 (December 2010): 48–53.
Cummings, J. “The Text Encoding Initiative and the Study of Literature.” In A Companion to Digital Literary Studies, edited by Ray Siemens and Susan Schriebman. Blackwell, 2013.
Dalton, J., L. Dietz, and J. Allan. “Entity Query Feature Expansion Using Knowledge Base Links.” In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, 365–74. 2014.
Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. “Indexing by Latent Semantic Analysis.” Journal of the American Society for Information Science 41, no. 6 (1990): 391–407.
Devlin, J., M. W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT 1 (January 2019).
D’Ignazio, C., and L. F. Klein. Data Feminism. Cambridge, Mass.: MIT Press. 2020.
Ehrmann, M., M. Romanello, S. Bircher, and S. Clematide. “Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers.” In European Conference on Information Retrieval, 524–32. Cham: Springer, 2020.
Erk, K., D. McCarthy, and N. Gaylord. “Investigations on Word Senses and Word Usages.” In Proceedings of ACL-IJCNLP 2009—Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and 4th International Joint Conference on Natural Language Processing of the AFNLP, 10–18. Stroudsburg, Pa.: Association for Computational Linguistics, 2009.
Featherstone, L. “Snippets and Shadows of Stories: Thoughts on Sources and Methods When Writing an Australian History of Sexuality.” In Intimacy, Violence and Activism: Gay and Lesbian Perspectives on Australasian History and Society, edited by G. Willett and Y. Smaal, 74–89. Melbourne: Monash University Publishing, 2013.
Gamallo, P., I. Rodríguez-Torres, and M. Garcia. “Distributional Semantics for Diachronic Search.” Computers and Electrical Engineering 65 (October 2018): 438–48. https://doi.org/10.1016/j.compeleceng.2017.07.017.
Gonen, H., G. Jawahar, D. Seddah, and Y. Goldberg. “Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 538–55. Stroudsburg, Pa.: Association for Computational Linguistics, 2020. https://doi.org/10.18653/v1/2020.acl-main.51.
Graham, S., I. Milligan, and S. Weingart. Exploring Big Historical Data: The Historian’s Macroscope. Singapore: World Scientific Publishing, 2015.
Guha, R., R. McCool, and E. Miller. “Semantic Search.” In Proceedings of the 12th International Conference on World Wide Web, 700–709. New York: Association for Computing Machinery, 2003.
Guldi, J. “Critical Search: A Procedure for Guided Reading in Large-Scale Textual Corpora.” Journal of Cultural Analytics 3, no 1 (2018). https://doi.org/10.22148/16.030.
Hamilton, W. L., J. Leskovec, and D. Jurasfky. “Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change.” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2116–21. Association for Computational Linguistics, 2016.
Hengchen, S., R. Ros, and J. Marjanen. “A Data-Driven Approach to the Changing Vocabulary of the Nation in English, Dutch, Swedish and Finnish Newspapers, 1750–1950.” In Proceedings of the Digital Humanities (DH) Conference. 2019.
Hofmann, T. “Probabilistic Latent Semantic Indexing.” In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57. 1999.
Koch, P. “Meaning Change and Semantic Shifts.” In The Lexical Typology of Semantic Shifts, edited by P. Juvonen and M. Koptjevskaja-Tamm, 21–66. Berlin: De Gruyter, 2016.
Kutuzov, A., L. Øvrelid, T. Szymanski, and E. Velldal. “Diachronic Word Embeddings and Semantic Shifts: A Survey.” In Proceedings of the 27th International Conference on Computational Linguistics, 1384–97. Association for Computational Linguistics, 2018.
Light, J. S. “When Computers Were Women.” Technology and Culture 40, no. 3 (1999): 455–83.
McDonough, K., L. Moncla, and M. van de Camp. “Named Entity Recognition Goes to Old Regime France: Geographic Text Analysis for Early Modern French Corpora.” International Journal of Geographical Information Science 33, no. 12 (2019): 2498–522.
McGillivray, B. “Computational Methods for Semantic Analysis of Historical Texts.” In Research Methods in the Digital Humanities, edited by K. Schuster and S. Dunn, 261–74. London: Routledge, 2020.
McGillivray, B., S. Hengchen, V. Lähteenoja, M. Palma, and A. Vatri. “A Computational Approach to Lexical Polysemy in Ancient Greek.” Digital Scholarship in the Humanities, 34, no. 4 (2019): 893–907. https://doi.org/10.1093/llc/fqz036.
McGillivray, B., T. Poibeau, and P. Ruiz Fabo. “Digital Humanities and Natural Language Processing: ‘Je t’aime . . . Moi non plus.’” DHQ: Digital Humanities Quarterly 14, no. 2 (2020).
Michel, J. B., Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331, no. 6014 (2011): 176–82.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. “Distributed Representations of Words and Phrases and Their Compositionality.” In Advances in Neural Information Processing Systems, 3111–19. Curran Associates, 2013.
Mitra, B., and N. Craswell. “Neural Models for Information Retrieval.” ArXiv:1705.01509 (2017).
Moretti, F. Distant Reading. London: Verso Books, 2013.
Munnelly, G., and S. Lawless. “Investigating Entity Linking in Early English Legal Documents.” In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, 59–68. New York: Association for Computing Machinery, 2018.
Olieman, A., K. Beelen, M. van Lange, J. Kamps, and M. Marx “Good Applications for Crummy Entity Linkers? The Case of Corpus Selection in Digital Humanities.” In Proceedings of the 13th International Conference on Semantic Systems, 81–88. New York: Association for Computing Machinery, 2017.
Osmond, G. “‘Pink Tea and Sissy Boys’: Digitized Fragments of Male Homosexuality, Non-Heteronormativity and Homophobia in the Australian Sporting Press, 1845–1954.” International Journal of the History of Sport 32, no. 13 (2015): 1578–92.
Putnam, L. “The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast.” American Historical Review 121, no. 2 (2016): 377–402. https://doi.org/10.1093/ahr/121.2.377.
Rovera, M., F. Nanni, S. P. Ponzetto, and A. Goy. “Domain-Specific Named Entity Disambiguation in Historical Memoirs.” In Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-IT 2017, 287–91. Aachen, Germany: CEUR, 2017.
Schütze, H., C. D. Manning, and P. Raghavan. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008.
Shen, W., J. Wang, and J. Han. “Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions.” IEEE Transactions on Knowledge and Data Engineering 27, no. 2 (2014): 443–60.
Story, D. J., J. Guldi, T. Hitchcock, and M. Moravec. “History’s Future in the Age of the Internet.” American Historical Review 125, no. 4 (2020): 1337–46.
Tahmasebi, N., L. Borin, and A. Jatowt. “Survey of Computational Approaches to Lexical Semantic Change.” ArXiv:1811.06278 (2018). https://arxiv.org/abs/1811.06278.
Traugott, E. C. “On the Persistence of Ambiguous Linguistic Contexts over Time: Implications for Corpus Research on Micro-Changes.” Corpus Linguistics and Variation in English, 231. 2012.
Ullmann, S. Semantics: An Introduction to the Study of Meaning. Basil: Blackwell, 1962.
Underwood, T. “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago.” Representations 127, no. 1 (2014): 64–72. https://doi.org/10.1525/rep.2014.127.1.64.
Xu, J., and W. B. Croft. “Query Expansion Using Local and Global Document Analysis.” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. January 1996.

Freedom on the Move and Ethical Challenges in the Digital History of Slavery

Show the following:

Adjust appearance:

Notes

Why Does Digital History Need Diachronic Semantic Search?

Searching for Meaning in Research

Why Search?

Approaches toward Semantic Search

When Does Time Play a Role?

Our Vision: Diachronic Semantic Search

Notes

Bibliography

Annotate