Chapter 19 Language Is Not a Default Setting: Countering DH’s English Problem

Quinn Dombrowski and Patrick J. Burns

Language should be everyone’s concern in the humanities, although it is often rendered invisible for scholars who work on Anglophone materials and live in Anglophone countries. In such a context, it is easy to not ever think about language beyond fleeting hints and shadows evoked by an accent mark in a text or a grammatical imperfection. The English language becomes simply “language”; English literature becomes, sweepingly, “literature.” In “Distant Reading after Moretti,” Lauren Klein calls for “more corpora—more accessible corpora—that perform the work of recovery or resistance” as a step toward addressing structural problems of sexism and racism in computational literary studies.1 We would like to echo the urgency of this call and amplify it with a twist: that language is another important axis of diversity that has received too little consideration, particularly in North American digital humanities (DH). In developing corpora and models—and, perhaps just as importantly, in developing the tools and tutorials needed to create, transform, or analyze these corpora and models—it is essential that we also challenge the role of English as a default setting in DH research.

Structural power may be held disproportionately in the Anglophone world, including but not limited to DH.2 English itself, however, has an impact beyond its role as a lingua franca of scholarly communication. Many marginalized voices are “hidden” in plain sight. They are inaccessible for enriching our collective understanding of concepts such as gender, race, and ability simply because they are written in languages without, for example, high-quality optical character recognition (OCR) options or natural language processing (NLP) tools, while funding and attention from both the academy and industry continue to be directed toward further refining tools for English. Without high-quality OCR, scholars are limited to creating images of text. Yet OCR alone is only the first step toward having a text that is usable for text analysis (Smith and Cordell). Digitized text without language-specific preprocessing tools may not even be usable with “language-neutral” methods such as topic modeling because of the variety of word forms present in the text. Digital humanities scholars are well placed to work toward developing more robust tools for non-English languages, particularly when that development takes the form of international collaboration. We argue that such tool development should be viewed as part of a broader agenda for diversifying DH.

The development of user-friendly tools that are accessible to scholars without a coding background is an expensive proposition, and it is reasonable to question whether the development of such tools in a language-specific manner is desirable or feasible. Tool-based methods are particularly common in pedagogical contexts, where it may be impractical to teach humanities students without previous technical backgrounds the fundamentals of code and the basics of implementing text analysis methods before they can see the potential value of those methods for answering disciplinary questions.3 Tools reduce the number of decisions that have to be made before a scholar can run an analysis, in part by providing default settings (e.g., tokenization, stopword removal) that should work in most cases—however, these default settings are often premised on the characteristics of the English language, such as a convention of separating words with spaces, a left-to-right reading order, and a relatively fixed word order. These characteristics of English are partly shared by other Indo-European languages, but they are not necessarily generalizable to other languages. User interface–based (UI) text analysis tools with English-oriented preprocessing rules give Anglophone scholars who are new to DH an easier on-ramp; if they continue with computational text analysis, they may move toward code-based workflows that provide more control, though at the cost of more complexity. But for students working in many other languages, there is no meaningful on-ramp: DH methods—including anything connected to counting words—simply “don’t work” when applied to a text that has not been preprocessed using algorithms that are not included in the easy-to-use DH tool.

Counting words is the simplest kind of text analysis, and the concept of word frequencies underpins more complex methods commonly used in DH, such as topic modeling and word vectors. But surely the definition of “word” is open to debate if one space-delimited “word” in an agglutinative language like Turkish translates roughly to fifteen English “words.”4 For character-based languages like Chinese and Japanese that do not separate words with whitespace, different segmentation algorithms (that artificially insert white space) disagree about how many words are in a text or where to place those boundaries. If students working with languages other than English simply substitute their own texts for English-language examples used in class, they are likely to find themselves frustrated or facing results that are misleading or erroneous.5

Code-based workflows that require scholars to specify the entire workflow avoid the constraints of UI-based text analysis tools with defaults that are hard to change. However, similar challenges arise for scholars who work with languages other than English. Pedagogical materials designed to teach students how to write code for implementing DH text analysis methods are often based on English-language examples and choose dependency libraries accordingly. Major libraries—Natural Language Toolkit, SpaCy, Stanford CoreNLP and Stanza, Stylo—support different sets of languages to varying degrees.6 If the library described in a tutorial supports the language that a scholar needs, adapting the tutorial to work with that language may simply be a matter of downloading different model data and substituting the name of the new model. This may still take a fair bit of guesswork, particularly for less technically proficient scholars, as tutorials rarely make explicit which steps are generic and which steps are language-specific. But if the library does not support the language a scholar needs, the tutorial is essentially useless. Some of the most effective libraries for less widely spoken languages are developed by researchers who are themselves native speakers of those languages.7 While the existence of these libraries is a tremendous help for scholars, the libraries often lack the kind of detailed, beginner-friendly documentation—let alone tutorials—that tend to accompany larger libraries, with a result of making them more difficult for beginners to use.

And yet, these conundrums—how to find an effective code library or adapt tools to meet the language-specific demands of a particular project—still assume a best-case scenario where work has already been done on developing NLP resources for the language. This is not the case for the vast majority of the world’s languages, particularly Indigenous languages and languages of the Global South (Risam, 44–46). While tremendous progress has been made in improving support for working with non-English languages in the last fifteen years, this progress has largely been concentrated in areas where there is perceived commercial value (like machine translation or audio recognition for languages with a large, affluent speaker base) or where grants and dedicated researchers have worked to close a gap after initial commercial investment (as, for example, with the expansion of Unicode to cover more African scripts; see Osborn, Anderson, and Kodama).8 New algorithms and approaches, including neural networks, have led to significant technical advancements in fields that digital humanists often borrow from, including computer science, computational linguistics, and information science. At the same time, Anglophone-centric tendencies in those fields take root in Anglophone digital humanities in indirect and subtle ways—for example, working toward hyper-optimized NLP algorithms as applied to a widely used English Wall Street Journal news corpus rather than improving the poor or mediocre accuracy of those algorithms for other languages and genres.9 Furthermore, when work is done on low-resource languages within computational linguistics or computer science, there are no structural incentives for scholars to advance their prototype to a point of usability, so there is a great deal of research that never results in tool development.10 This is another manifestation of the structural power held by Anglophone academia to the detriment of broader global communities: Prestige comes from publishing academic papers, not from developing something that people can use. A new prototype can lead to a new publication, but the labor required for a usable tool is not publishable, visible, or rewarded.

There have been notable steps forward within the DH community to improve the support landscape through more language-aware tutorials and the development of tools that are either linguistically flexible or designed specifically to support under-resourced languages. Stéfan Sinclair and Geoffrey Rockwell’s Voyant Tools is one example of a project engaging with requests for supporting linguistic diversity, from building in features to handle segmented Tibetan to working through the details of stopword lists for ancient Greek and Latin.11 Other digital humanities scholars whose work aligns closely with tool building, such as David Bamman, have directed attention toward supporting communities of scholars who work with the “long tail” of languages in developing NLP tools that meet their research needs. Within and beyond DH, groups of committed individuals with the relevant linguistic expertise are collaborating through events like the African Language Dataset Challenge and the Masakhane Project for developing machine-translation models for African languages.12

We also see a clear bright spot in mitigating the lack of pedagogical materials for languages where NLP resources do exist from the Programming Historian, which introduced a Spanish version of the site in 2017, a French version in 2019, and a Portuguese version in 2021 (Afanador-Llach; Papastamkou; Alves and Isasi; Walsh).13 These projects—at the time of writing there were fifty-seven Spanish, twenty-two French, and twenty-six Portuguese tutorials—started largely with translations of existing tutorials, but it is clear from a recent call for submissions that there is interest in developing language-specific content. That is, there is an interest in avoiding an English-as-default approach to the site’s workflow.14 Fourteen original Spanish lessons have been published since the call. While we wait for more Spanish-, French-, and Portuguese-originating tutorials, it is worth calling attention to the subtle ways that Programming Historian decenters English through its “internationalization strategy.”15 One tutorial adds a section called “Notas sobre las palabras en español” (“Notes about the Spanish words”), effectively drawing attention to the ways in which Spanish words need to be handled differently than English words in text analysis (Turkel and Crymble). Another tutorial adds a supplementary French bibliography so that readers can pursue further reading in the same language (Froehlich). In the tutorials, comments referring to external documentation like “disponibles solamente en inglés” (“only available in English”) and “en anglais seulement” (“only in English”) expose the Anglocentricity latent in the tools and gesture toward an alternative future of linguistically complete support that Programming Historian aims to provide in time (Dewar; Laramée). More work, however, remains to be done, particularly in the development of tutorials building on the needs, interests, and opportunities emerging from other languages (e.g., the analysis of salient grammatical structures like politeness markers in Japanese) that have no easy analogue in English.

In their 2017 article on defining a critical approach to interdisciplinary work in modern languages and digital humanities, Thea Pitman and Claire Taylor write: “As DH continues to mature and see itself less as providing tools, and more as enabling critical ways of thinking, [the field of Modern Languages] can contribute . . . a contestation of assumptions regarding (unstated) Anglophone models of the digital.” We agree that research at the intersection of modern languages and DH is essential to contesting linguistic assumptions. At the same time, we note that “providing tools” is a way of “enabling critical ways of thinking.” This is in keeping with Roopika Risam’s reminder that “digital humanities tools, methods, and projects must be used and built with the understanding that biases and values are built into these essential elements of practice” (40). And as James Dobson writes, “Any humanities method using computation must apply reflexive thought to all stages of computation, iteratively applying an interpretative frame to those filters, functions, tools, and transformations that would otherwise obscure the interpretative work embedded within these building blocks of computation” (131). Assumptions latent within the tool can have massive implications on interpretation and as such must be subject to critique.

At the same time, before these assumptions can be subject to critique, they need to be acknowledged. Luckily for digital humanists, we have an example that we can follow in the adjacent field of natural language processing. Emily Bender has advocated for NLP researchers to be more explicit about naming the languages they are working on, a formulation now referred to as the “Bender Rule.”16 Bender argues that for NLP to move toward being “language independent”—that is, to become more generalizable across more languages—the field must incorporate specific linguistic knowledge.17 For this to happen, the first step is for researchers to be more transparent and more explicit about which languages they are working on.18 As we have argued above, a failure to make use of specific linguistic knowledge and a tendency to assume English as a kind of default for text analysis work raises similar issues in DH. It is in this spirit that we suggest the following: digital humanities needs a Bender Rule.

Explicitly acknowledging English as the language of the texts we study (when what we are studying is, in fact, English-language text), as the basis of the corpora we draw from and train our models on (when we build on computer science and NLP methods developed using English), and as the reason for the default parameters of the tools we use (when they are developed by or for the Anglophone world) promotes a critical digital humanities at a fundamental level and from the outset of the research process. Challenging the “unearned advantage” afforded to native English speakers in learning to code, Gretchen McCulloch writes that “when we name the English default, it becomes more obvious that we can question it.” For digital humanities, it is not precisely a matter of advantage or disadvantage (although that does manifest itself). Rather, we have a humanistic responsibility to recognize, challenge, and overcome our assumptions, including those linguistic assumptions discussed above that too often pass without notice or debate. Only then can we avoid reinforcing structures of power and begin to take steps to undermine these structures by surfacing and centering hidden voices, regardless of whether they are obscured on account of race, gender, ability, or language.19

Notes

See Klein, citing and building on previous scholarship from Moya Bailey, Miriam Posner, Lisa Rhody, Tanya Clement and Jessica Marie Johnson, and Laura Mandell.
Return to note reference.
See numerous critiques of the structural power of English within digital humanities over the last decade, including Cohen (236–37); Fiormonte; Clavert; Galina; Grandjean; Ortega (“Local and International Scalability” and “Multilingualism”); Spence; Dacos; Gil and Ortega; Mahony; Risam.
Return to note reference.
We have in mind here what Rockwell and Sinclair (Hermeneutica) refer to as “tools-as-methods”—that is, computer-assisted text-analysis environments, interfaces, toolboxes, and so on that do not require programming skills of their users.
Return to note reference.
On Turkish grammar, see the “Longest Word in Turkish,” Wikipedia, https://en.wikipedia.org/wiki/Longest_word_in_Turkish. Knowles and Don (80) ask similar questions about “lemmas,” or dictionary headwords, and whether it is possible to apply this concept in a meaningful way across languages as different as English, Latin, Arabic, and Malay, concluding that a failure to acknowledge features specific to individual languages could unhelpfully lead to “the practice of looking at all languages as though they were varieties of English.”
Return to note reference.
See Galina: “Methods that have worked effectively in one cultural setting may fail spectacularly in another (and vice versa) and certain reasoning of how things should work does not apply similarly to other frameworks.”
Return to note reference.
There are also projects like Universal Dependencies (Zeman et al.), a collection of linguistic treebanks developed to enable cross-linguistic comparability of syntactic structures, that run the risk of flattening the more nuanced features of individual languages—features that may be among the more interesting things to track, particularly in a literary context. While projects of this sort provide value for some research questions, they can also suggest a kind of “specious universalism” (Risam, 40).
Return to note reference.
See, for instance, the work of the Ixa group on Basque at http://ixa.si.ehu.es/.
Return to note reference.
While much progress has been made in the last decade, a call in bold, red letters at the top of the Script Encoding Initiative’s website (http://www.linguistics.berkeley.edu/sei/index.html) reminds us that “over 100 scripts remain to be encoded.”
Return to note reference.
It must be noted that the particular prevalence of the Wall Street Journal, Wikipedia, and other modern web-based text corpora privilege modernity as much as English. Scholars of premodern English face the same scarcity of NLP resources as scholars of non-English languages.
Return to note reference.
Note, for example, the ratio of software to scientific papers in the list of NLP research and engineering for American Native/Indigenous Languages maintained by Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, and Ivan Meza-Ruiz at https://github.com/pywirrarika/naki.
Return to note reference.
The tools are available at https://voyant-tools.org/. The focal point for discussion of improving language support in Voyant is the Issues section of their GitHub repository; for Tibetan segmentation, see https://github.com/sgsinclair/Voyant/issues/342 and for ancient Greek and Latin stopwords, see https://github.com/sgsinclair/Voyant/issues/382. See also Chapter 3 in this book for a longer discussion of how the Arabic version of Voyant was developed.
Return to note reference.
For the African Language Dataset, led by Kathleen Siminyu, see https://ai4d.ai/african-language-dataset-challenge/; for the Masakhane Project, see https://www.masakhane.io/.
Return to note reference.
It should be expressly stated that Programming Historian en español, Programming Historian en français, and Programming Historian em português as is made clear from the index page at https://programminghistorian.org, are not sections of the “initial English version” but online journals in their own right, each with a unique ISSN. See also Sichani: “These new full-language initiatives stand as huge milestones in our linguistic diversity strategy.”
Return to note reference.
See Isasi and Motilla: “Nos han motivado a abrir un convocatoria para la recepción de propuestas de tutoriales de contenidos originales en español, los cuales, una vez aprobados y publicados, pueden ser traducidos al inglés para su amplia difusión en el mundo no hispanohablante.” (“[Various factors in Spanish-language DH] have motivated us to open up a call for original-content proposals written in Spanish, which once accepted and published, can be translated into English for better distribution outside of the Spanish-speaking world.”)
Return to note reference.
According to Sichani: “As part of our internationalization strategy, we encourage authors to write for a Global Audience by making choices (methods, tools, primary sources, bibliography, standards) with multi-lingual readers in mind while also being aware of cultural differences.” Programming Historian’s “Author Guidelines” (https://programminghistorian.org/en/author-guidelines) include a section that similarly advises authors to take care in choosing methods and tools as they “may not support other character sets or may only provide intellectually robust results when used on English texts.”
Return to note reference.
The rule is best seen in action as the hashtag #BenderRule; see https://twitter.com/search?q=%23BenderRule.
Return to note reference.
Bender (“Linguistically Naïve”) notes here that at the 2008 Annual Meeting of the Association of Computational Linguistics, the most common language studied was English with eighty-one papers; the second most common languages were Chinese and German with five papers each.
Return to note reference.
The first formulation of the rule comes in a section on “prescriptions for typologically-informed NLP,” a ten-point list of best practices for computational language research (Bender, “On Achieving and Evaluating Language-Independence in NLP,” 18–19). The “prescription” reads: “Do state the name of the language that is being studied, even if it’s English. Acknowledging that we are working on a particular language foregrounds the possibility that the techniques may in fact be language-specific. Conversely, neglecting to state that the particular data used were in, say, English, gives false veneer of language-independence to the work.”
Return to note reference.
See Gorman: “We must not confuse the act of perceiving and naming the hegemon with the far more challenging act of actually combating it.”
Return to note reference.

Bibliography

Afanador-Llach, M. J. “¡Bienvenidos a The Programming Historian en español!” Programming Historian. March 5, 2017, https://programminghistorian.org/posts/lanzamiento-PH-espanol.
Alves, D., and Isasi, J. “Publicação do Programming Historian em português.” Programming Historian. January 29, 2021, https://programminghistorian.org/posts/launch-portuguese.
Bamman, D. “Natural Language Processing for the Long Tail.” Presentation at DH2017, Montreal, Quebec, Canada, 2017, http://people.ischool.berkeley.edu/~dbamman/pubs/pdf/dh2017.pdf.
Bender, E. M. “Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology.” In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? edited by Timothy Baldwin and Valia Kordoni, 26–32. Stroudsburg, Pa.: Association for Computational Linguistics, 2009.
Bender, E. M. “On Achieving and Evaluating Language-Independence in NLP.” Linguistic Issues in Language Technology 6, no. 3 (2011): 1–26.
Bender, E. M. “The #BenderRule: On Naming the Languages We Study and Why It Matters.” The Gradient. September 14, 2019, https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/.
Clavert, F. “The Digital Humanities Multicultural Revolution Did Not Happen Yet.” L’histoire contemporaine à l’ère numérique. 2013, https://histnum.hypotheses.org/1546.
Cohen, M. “Design and Politics in Electronic American Literary Archives.” In The American Literature Scholar in the Digital Age, edited by A. E. Earhart and A. W. Jewell, 228–49. Ann Arbor: University of Michigan Press, 2011.
Dacos, M. “La stratégie du sauna finlandais: Les frontières des Digital Humanities.” Digital Studies/Le Champ Numérique 6, no. 2 (2016), https://www.digitalstudies.org/articles/10.16995/dscn.41/.
Dewar, T. “Datos tabulares en R.” Translated by Jennifer Isasi, Joseba Moreno, and Antonio Rojas Castro. Programming Historian. September 5, 2016, https://programminghistorian.org/es/lecciones/datos-tabulares-en-r.
Dobson, J. E. Critical Digital Humanities: The Search for a Methodology. Champaign, Ill.: University of Illinois Press, 2019.
Fiormonte, D. “Towards a Cultural Critique of the Digital Humanities.” Historical Social Research/Historische Sozialforschung 37, no. 3 (2012): 59–76.
Froehlich, H. “Analyse de corpus avec AntConc.” Translated by Hugo Bonin and Sofia Papastamkou. Programming Historian. June 19, 2015, https://programminghistorian.org/fr/lecons/analyse-corpus-antconc.
Galina, I. “Is There Anybody Out There? Building a Global Digital Humanities Community.” Humanidades Digitales. July 19, 2013, http://humanidadesdigitales.net/blog/2013/07/19/is-there-anybody-out-there-building-a-global-digital-humanities-community.
Gil, A., and Ortega, é. “Global Outlooks in Digital Humanities.” In Doing Digital Humanities: Practice, Training, Research, edited by C. Crompton, R. J. Lane, and R. Siemens, 22–34, London: Routledge, 2016.
Gorman, K. “Action, Not Ritual.” Wellformedness. 2019, http://www.wellformedness.com/blog/action-not-ritual/.
Grandjean, M. “Le rêve du multilinguisme dans la science : l’exemple (malheureux) du colloque #DH2014.” Martin Grandjean. June 27, 2014, http://www.martingrandjean.ch/multilinguisme-dans-la-science-dh2014/.
Isasi, J., and Motilla, J. A. 2018. “Convocatoria para lecciones en español en The Programming Historian.” Programming Historian. April 6, 2018, https://programminghistorian.org/posts/convocatoria-de-tutoriales.
Klein, L. F. “Distant Reading after Moretti.” Arcade. 2018, https://arcade.stanford.edu/blogs/distant-reading-after-moretti.
Knowles, G., and Don, Z. M. “The Notion of a ‘Lemma’: Headwords, Roots and Lexical Sets.” International Journal of Corpus Linguistics 9, no. 1 (2004): 69–81.
Laramée, F. D. “Introduction à la stylométrie en Python.” Translated by François Dominic Laramée and Sofia Papastamkou. Programming Historian. April 21, 2018, https://programminghistorian.org/fr/lecons/introduction-a-la-stylometrie-avec-python.
Mahony, S. “Cultural Diversity and the Digital Humanities.” Fudan Journal of the Humanities and Social Sciences 11, no. 3 (2018): 371–88, https://doi.org/10.1007/s40647-018-0216-0.
McCulloch, G. “Coding Is for Everyone—as Long as You Speak English.” Wired.com. April 8, 2019, https://www.wired.com/story/coding-is-for-everyoneas-long-as-you-speak-english/.
Ortega, é. “Local and International Scalability in DH.” Readers of Fiction (in Internet Archive Wayback Machine). July 14, 2014, https://web.archive.org/web/20140714103841/http://lectoresdeficcion.blogs.cultureplex.ca/2014/07/02/scalability/.
Ortega, é. “Multilingualism in DH.” Disrupting the Digital Humanities: 2015 MLA Position Papers (in Internet Archive Wayback Machine). December 31, 2014, https://web.archive.org/web/20210424073656/https://www.disruptingdh.com/multilingualism-in-dh/.
Osborn, D. Z., D. W. Anderson, and S. Kodama. “Support for Modern African Languages and Scripts in Unicode/ISO 10646: Where Are We Today?” Presentation at the 32nd Internationalization and Unicode Conference, San Jose, California, September 2008.
Papastamkou, S. “Bienvenue au Programming Historian en français!” Programming Historian. April 8, 2019, https://programminghistorian.org/posts/bienvenue-ph-fr.
Pitman, T., and Taylor, C. “Where’s the ML in DH? And Where’s the DH in ML? The Relationship between Modern Languages and Digital Humanities, and an Argument for a Critical DHML.” DHQ: Digital Humanities Quarterly 11, no. 1 (2017).
Risam, R. New Digital Worlds: Postcolonial Digital Humanities in Theory, Praxis, and Pedagogy. Evanston, Ill.: Northwestern University Press, 2019.
Rockwell, G., and Sinclair, S. Hermeneutica: Computer-Assisted Interpretation in the Humanities. Cambridge, Mass.: MIT Press, 2016.
Sichani, A.-M. “Linguistic Diversity and Ad-Hoc Translation of the Programming Historian’s Lessons.” Programming Historian. November 30, 2018, https://programminghistorian.org/posts/ad-hoc-translation.
Smith, D. A., and R. Cordell. “A Research Agenda for Historical and Multilingual Optical Character Recognition.” 2018, https://ocr.northeastern.edu/report/.
Spence, P. “Centros y fronteras: el panorama internacional.” Janus, Anexo 1, April 11, 2014, https://www.janusdigital.es/anexos/contribucion.htm?id=6.
Turkel, W. J., and A. Crymble. “Contar frecuencias de palabras con Python.” Translated by Victor Gayol, Jairo A. Melo, Maria José Afanador-Llach, and Antonio Rojas Castro. Programming Historian. July 17, 2012, https://programminghistorian.org/es/lecciones/contar-frecuencias.
Walsh, Brandon. “The Programming Historian and Editorial Process in Digital Publishing.” Brandon Walsh (blog). January 15, 2021, http://walshbr.com/blog/the-programming-historian-and-editorial-process-in-digital-publishing/.
Zeman, D., J. Nivre, M. Abrams, E. Ackermann, N. Aepli, Ž. Agić, L. Ahrenberg, et al. “Universal Dependencies 2.6.” LINDAT/CLARIAH-CZ Digital Library. Institute of Formal and Applied Linguistics (úFAL), Faculty of Mathematics and Physics, Charles University. 2020, http://hdl.handle.net/11234/1-3226.

Chapter 20

Show the following:

Adjust appearance:

Notes

Chapter 19

Language Is Not a Default Setting: Countering DH’s English Problem

Notes

Bibliography

Annotate