Bringing Languages into the DH Classroom
Quinn Dombrowski
Imagine scraping text from a website that appears perfectly readable in your browser, but your scraping tool transforms the text into a completely different alphabet, rendering it unusable for any computational purpose. Imagine if all the commonly used digital humanities (DH) text analysis methods that are based on some notion of “word counts”—from word clouds to collocation to topic modeling to word vectors—were unusable on your text unless you first transformed it into a grammatically incorrect derivative. What if named-entity recognition did not exist—or worked so poorly for your language that it may as well not exist—meaning you would need to resort to word-list searches or manual labor to identify people or places in large text corpora? For scholars who apply DH methods to languages other than English, some or all of these situations are simply the background to their work: challenges they must persist through, devoting time and labor to developing workarounds, creating new tools, or simply doing things manually, when their English-language colleagues benefit from a rich array of tools, tutorials, and support.
When I started doing DH in 2004, I was a student in a Slavic department. In that program, linguistic diversity was the norm. The department regularly held talks about most of the Slavic languages, as well as multiple non-Slavic languages from the region, such as Albanian, Romani, and Georgian. Looking at slides covered in text you could not read, sometimes in an alphabet you could not read, was normal. My earliest DH projects involved medieval East Slavic manuscripts, a corpus of handwritten documents from northern Russia dating as early as the tenth century, and Bulgarian dialectological atlases. Consequently, my view of how hard or easy it was to do DH was shaped by Slavic languages, and only later did I come to realize how much easier some methods are for English. After spending a decade working in university IT and building general-purpose DH infrastructure (including Project Bamboo, the DiRT tool directory, and the DHCommons project directory), I returned to the intersection of DH and languages in fall 2018 by becoming an academic technology specialist at Stanford University. Mine is a technically oriented DH staff position split between Stanford Libraries and the Division of Literatures, Cultures, and Languages—which includes all languages, cultures, and literatures except for English, East Asian, and classics. A few months after starting in this role, I taught a course, DLCL 204: Digital Humanities across Borders, which was centered on non-English languages. It was an all-consuming experiment, and the first of its kind; while there had been courses taught in U.S. institutions on DH in the context of a particular non-Anglophone region, such as Molly des Jardin’s East Asian Digital Humanities course at the University of Pennsylvania, as well as language courses that had incorporated DH methods, such as a French class at Kansas State University, there did not seem to be a precedent for a course on doing DH in any language students chose to work in.1 The course is also unlikely to be replicable in many other institutions, since excluding monolingual Anglophone students is perceived as being bad for enrollment. Nevertheless, it is worth noting here that while including a requirement that students be able to read a non-English language will exclude some students, it may also have the positive effect of attracting students who feel like the typical set of DH offerings based out of the English Department is not “for them”; the second iteration of the class, held virtually and asynchronously in fall 2020, had twenty-two students. While I believe there is value in creating spaces that center non-Anglophone DH and that do not require participants to explain or justify how their process diverges from expectations set by English departments, a separate class is not the only way to provide better support for students working in other languages. This chapter summarizes some of the lessons learned in teaching a course that centers non-English languages, with the goal of supporting other instructors in running their courses in a more linguistically inclusive way, in order to better support the needs of students who work in other languages, who may already face systemic challenges compared with their Anglophone-oriented peers.2
Anglophone Assumptions in DH Pedagogy
Almost all hands-on digital humanities pedagogy in the United States is shaped by the assumption that all students will be applying these skills to English-language materials. The default of English is so pervasive, instructors are often not even aware of it. We see this in the naming strategies of workshops and institutes that promote themselves as practical introductions to “text analysis” or “natural-language processing” (NLP), despite the fact that the workflow taught would not work when applied to materials in a language other than English. Instructors put a good deal of thought into developing example datasets for these workshops, picking materials whose nature (e.g., literature, historical records) and structure (e.g., running text, tabular data) are comparable to what the students will encounter in their own data. And yet, language rarely factors into these considerations, with English treated as an unspoken default, even when the language of instruction is not English.3
Among the methods taught in hands-on workshops and general introductions to DH, GUI-based tools like Voyant (https://voyant-tools.org) are commonly used as a starting point for text analysis. These tools provide a quick and easy way for students to explore DH methods ranging from frequency analysis to collocation to topic modeling, without having to learn to write any code. However, out-of-the-box tools that implement methods based on word counts assume that “words” are groups of characters reliably separated by spaces and that a particular “word” does not have many variant forms. These assumptions work well for modern, standardized English, but they quickly break down for languages with more complex morphology (such as most Slavic languages, Finnish, Turkish, and various others), producing messy, unhelpful, or even nonsensical results (for example, if a student unwittingly uploads text not encoded in UTF-8). A limited degree of multilingual support (e.g., Voyant’s stopword lists in multiple languages, or the automatic segmentation for Chinese and Japanese texts) can convey a false sense of security that the tool “works” across languages, when in fact a preprocessing step is needed to get results comparable to English.4 While text analysis is used as the example here, similar issues arise with geographic information systems.5 Before the advent of recent machine learning methods that better support research on images at the pixel level, image-oriented DH work often leaned heavily on text-based metadata as a proxy, which reintroduces language as an issue.6 These concerns are still not entirely avoidable; while some properties used for training machine classification of images can be conveyed purely in numbers (hue, light/dark, etc.), these models are sometimes trained using labeled data, and labeled datasets most often use English for those labels.
Longer workshops and courses in DH may expand from out-of-the-box tools to teaching students how to undertake text analysis using code. While teaching students to code can facilitate better preprocessing options and more nuanced analyses than is feasible through prebuilt tools, the overwhelming predominance of English has had a tremendous impact on computer science and computational linguistics. English NLP models are objects of ongoing research activity, with massive resources directed toward improving performance that is already superb by comparison with other languages. As a result, text analysis courses that default to English can draw on algorithms and libraries for working with modern English that are vastly more effective than comparable models for other major world languages, let alone less widely spoken languages. The poor quality or complete absence of usable NLP resources for the vast majority of the world’s languages is a significant impediment to computational research, and while scholars in computer science and DH have taken steps toward addressing these unmet needs, much work remains to be done before they are anywhere near as functional as the English models.7 And yet, these NLP limitations are not the first problem the student of non-English text analysis is likely to encounter in the Anglophone DH classroom. That problem, instead, is the invisibility of language: instructor obliviousness about just how thoroughly steeped in English their curriculum is.
When DH methods are taught by and for scholars who principally live and work in other languages, acknowledging the reality of the disproportionate resources for English comes as no surprise: the digital hegemony of English is unavoidable in daily life, from video games to keyboards to UI design.8 However, for most Anglophone scholars in primarily Anglophone countries, the discrepancy between English and every other language is all but invisible. These scholars may have some sense that some DH tools might not “work” for students who bring text in other languages, but unless they have been exposed to the rich variation of morphological, syntactic, and orthographic variation of the world’s languages (e.g., through studying a language with a non-Latin writing system), it is nearly impossible to move from a general sense of something “not working” to remedying the situation to the extent possible.
Students working on modern English often walk away from pedagogical encounters with DH having acquired tools and skills that they can immediately put to use, and students working on other languages are left struggling, without a sense of whether or how they could achieve results comparable to their peers’. The following is a series of considerations, provocations, and suggestions to help DH instructors better support students who work on languages other than English.
Bringing Languages into Your Classroom
The following recommendations are a mix of the practical and philosophical, with implications both for what is taught in the digital humanities classroom and for how language is included as a factor that merits and even necessitates explicit discussion.
Name Language
Too often, words like literature and text are used in classroom settings to make reference exclusively to English literature and English-language text. In a North American context, any literature or text that is not in English must be marked with a modifier (“Spanish literature,” “classical Chinese text”), but English literature is simply “literature.” This is a problem that reaches beyond the classroom, and even beyond the humanities—the same tendencies in computational linguistics have meant that NLP most commonly means “English-language processing.” In response to this, linguist Emily Bender advises always naming the languages under discussion.9
On one level, naming language is a matter of accuracy. If you are teaching a workshop on named-entity recognition and are working with code or tools that only support English, it is important to state that in the workshop title or description so that students who want to learn that method to use it with other languages will not waste their time attending.
On another level, naming languages is about inclusivity. Treating English as an unmarked default, in contrast to every other language, alienates students who do not work on English. If “literature” always means “in English,” it frames literature in any other language as a deviation from the norm. There is nothing wrong with working with English-language material, but omitting “English” when you talk about it contributes to the erasure of English being a language, one of many, with its own peculiarities.
Learn about the Languages Your Students Want to Work With
Digital humanities courses are often interdisciplinary. Because not every department has the interest, resources, or expertise to teach a course specifically on how DH is practiced in that particular field, DH courses are often taught out of large departments like English or history, with the expectation that students from other departments will also attend. DH workshops can bring in students from throughout the humanities and even the social sciences. In these contexts, do not assume that your students are interested in English. Ask upfront (e.g., in a survey at the beginning of the course, or in a workshop registration form) what languages they work on.
Once you know what languages your students work with, you can consult the growing set of resources for doing DH work in non-English languages, such as those available at Multilingual DH (http://multilingualdh.org/). However, supporting your non-Anglophone students involves more than pointing the students to an index of tools for their language and wishing them good luck. Learn the basics of how the language works grammatically and use that to predict potential challenges. Does the language have a writing system that does not put spaces between the words (like Japanese or Chinese), so that the text will need to be segmented? Does the language have complex morphology (like Turkish or Finnish) where words will need to be lemmatized before they can be used for word counting? You can prepare by reviewing common issues you may encounter with different kinds of languages.10
If you do not think you can help your students work through these issues yourself, at least make them aware that they may experience challenges applying the methods of the course to their language. If possible, reach out to the broader DH community involved with non-Anglophone DH to see if anyone with relevant linguistic experience is available to consult with your students.
Consider Multilingual Support When Picking Tools
Some methods necessitate language-specific tools. Depending on what you are trying to do and what languages your students are working in, give additional consideration to tools and software libraries that can support your non-English students as well as the ones working on English. For example, if you have students working on Portuguese, you might want to use the Python spaCy library instead of the older Java-based Stanford Core NLP, which does not support Portuguese. At the same time, make space for students to talk about the quality of the results they get. Do they differ significantly across languages? Support students in responding to the tools they try; you may not be able to judge the result quality, but they often can, and their input is valuable for the next time you run the course or workshop. Having a single tool that can support everyone’s work is ideal, but if the results are abysmal for a particular language, look for an alternative next time.
Get Comfortable Looking at Text You Cannot Read
Supporting non-Anglophone DH involves looking at text in other languages, whether it is configuration settings on your student’s laptop where the OS is set to use another language or their research materials. Your student can read it, so treat it as another type of DH collaboration. While not a substitute for literary translation, machine translation tools like Google Translate are perfectly functional for practical uses such as differentiating which text on a webpage that you want to scrape is navigation, ads, or main content.
Be Flexible about What “Success” Looks Like
Particularly for languages beyond the major languages of Western Europe and East Asia, there may be cases where tools and algorithms do not exist for students to complete the same tasks as their counterparts who work on better-resourced languages. Despite your efforts and theirs, they may not be able to successfully lemmatize their text to get a meaningful result from text-frequency methods, or their named-entity recognition assignment may be an exercise in hilarious errors committed by the algorithm. As Shawn Graham notes in “Horses to Water,” students’ anxiety about how they will be graded can often scuttle “creative” assignments—and the same can be true even for more traditional assignments that work well in English, for students who work in other languages.11 Whether or not you take a radically different approach to grading altogether, it is worth being clear with students that “success” does not always mean “getting the right answer” but instead trying something and reflecting on what they have learned.
Multilingual DH Is Collaborative
The work that goes into supporting multiple languages in the classroom is considerable, but you do not have to do it alone. A community of scholars who do DH work in languages other than English is engaged in discussion on Twitter under the hashtag #MultilingualDH. The group runs the aforementioned website with links to resources including tools and tutorials, along with a mailing list open to anyone. Whether you have a question about how to adapt the example data you work with in class to be more linguistically and culturally inclusive, about what resources are available to apply text analysis methods to a particular language, or even simply how to get started, consider reaching out to colleagues who are doing this work. Taking steps toward linguistic inclusivity in the classroom can have a meaningful impact on all your students, regardless of their language of focus.
Notes
1. Des Jardin, “East Asian DH Guides”; Cro and Kearns, “Developing.”
2. For a broader discussion of these issues, see Prado, “Academe’s Shameful.” For a DH-centric view, see Dombrowski, “Stakes.”
3. See, for instance, Schreibman and Bleier, “Text Encoding”; while the course is available in English, French (“L’encodage textuel et la Text Encoding Initiative”), and Hungarian (“Szövegkódolás és a Text Encoding Initiative”), the same hands-on activity presented in all three contexts is encoding “When You Are Old” by W. B. Yeats, in English. Similarly, the French version of the Programming Historian tutorial on stylometry (Laramée, “Introduction”) includes an extensive historical note explaining the context of the Federalist Papers, which is used as the example in both English and French.
4. Dombrowski, “Preparing Non-English Texts.”
5. McDonough, Moncla, and van de Camp, “Named Entity Recognition.”
6. Arnold and Tilton, “Distant Viewing.”
7. See, for instance, Fishkin, Jenn, and Fraisse, Rosetta, an attempt to develop a parallel corpus of literary translations; Center for Digital Humanities, New Languages, for examples of DH scholars improving NLP models; and Walsh, Introduction, for examples of how to incorporate the NLP models for multiple languages into a digital textbook.
8. Gibson, “Thinking in ⅃TЯ.”
9. For further discussion, see Dombrowski and Burns, “Language Is Not.”
10. Dombrowski, “Preparing Non-English Texts.”
11. Graham, “Horses to Water.”
Bibliography
- Arnold, Taylor, and Lauren Tilton. “Distant Viewing: Analyzing Large Visual Corpora.” Digital Scholarship in the Humanities 34, Supplement 1 (December 2019): i3–i16. https://doi.org/10.1093/llc/fqz013.
- Center for Digital Humanities. New Languages for NLP: Building Linguistic Diversity in the Digital Humanities. Princeton University. https://newnlp.princeton.edu.
- Cro, Melinda A., and Sara K. Kearns. “Developing a Process-Oriented, Inclusive Pedagogy: At the Intersection of Digital Humanities, Second Language Acquisition, and New Literacies.” Digital Humanities Quarterly 14, no. 1 (2020). http://digitalhumanities.org/dhq/.
- Des Jardin, Molly. “East Asian DH Guides.” https://mollydesjardin.com/eadh.
- Dombrowski, Quinn. “Preparing Non-English Texts for Computational Analysis.” Modern Languages Open 1 (2020). https://doi.org/10.3828/mlo.v0i0.294.
- Dombrowski, Quinn. “The Stakes of Multilingual DH in the United States.” Quinn Dombrowski (blog), March 13, 2020. https://quinndombrowski.com/blog/2020/03/13/stakes-multilingual-dh-united-states.
- Dombrowski, Quinn, and Patrick Burns. “Language Is Not a Default Setting.” In Debates in the Digital Humanities 2023, edited by Matthew K. Gold and Lauren F. Klein, 295–304. Minneapolis: University of Minnesota Press, 2023. https://dhdebates.gc.cuny.edu.
- Fishkin, Shelley Fisher, Ronald Jenn, and Amel Fraisse. Rosetta: Resources for Endangered Languages through Translated Texts. 2018–19. https://francestanford.stanford.edu.
- Gibson, Nathan P. “Thinking in ⅃TЯ: Reorienting the Directional Assumptions of Global Digital Scholarship.” Paper presented at the Right2Left Workshop, Victoria, Canada, June 2019. https://tinyurl.com/gibson190608.
- Graham, Shawn. “Horses to Water.” In Failing Gloriously and Other Essays, 87–88. Grand Forks, N.Dak.: Digital Press at the University of North Dakota, 2019. https://doi.org/10.31356/dpb015.
- Laramée, François Dominic. “Introduction à la stylométrie en Python.” Programming Historian. 2018, 2019. https://doi.org/10.46430/phfr0003.
- McDonough, Katherine, Ludovic Moncla, and Matje van de Camp. “Named Entity Recognition Goes to Old Regime France: Geographic Text Analysis for Early Modern French Corpora.” International Journal of Geographical Information Science 33, no. 12 (2019): 2498–522. https://doi.org/10.1080/13658816.2019.1620235.
- Prado, Ignacio Sánchez. “Academe’s Shameful Neglect of Spanish.” Chronicle of Higher Education Review, March 13, 2020. https://www.chronicle.com/article/academe-s-shameful-neglect/248236.
- Schreibman, Susan, and Roman Bleier. “Text Encoding and the TEI Initiative.” #dariahTeach, March 21, 2017. https://teach.dariah.eu/mod/page/view.php?id=197.
- Walsh, Melanie. Introduction to Cultural Analytics and Python. 2021. https://melaniewalsh.github.io/Intro-Cultural-Analytics.