Notes
Chapter 30
Computer Science Research and Digital Humanities Questions
Benjamin Charles Germain Lee
Computer science research and pedagogy occupy essential roles in the digital humanities, informing the methodologies and tools utilized by researchers and the digital systems and interfaces built by them. Although the digital humanities are canonically situated within humanities departments in the context of graduate study, I explore the possibilities for the digital humanities as a rich discipline for graduate research within computer science. In this chapter, I also draw on my experiences of pursuing digital humanities research in the context of graduate study as a PhD student in computer science and engineering at the University of Washington, a large research-intensive public university in the United States.
Methodological advances in computer science in their many manifestations continually shape the contours of research within the digital humanities, as well as our understanding of the discipline itself. Consider, for example, the deep learning revolution of the past decade within computer science and how it has reverberated through the landscape of digital humanities research. Deep learning has reframed how digital humanities scholars study modalities as far ranging as text corpora, photograph collections, television shows, audio recordings, and born-digital artifacts (Arnold and Tilton, “Distant Viewing”; Underwood et al., “Transformation of Gender”; Wevers and Smits, “Visual Digital Turn”). Within the libraries, archives, and museums community, optical character recognition has already transformed how scholars interact with digitized textual sources, and applications of machine learning and artificial intelligence continue to show great promise for digital content stewardship and content discovery (LC Labs and Digital Strategy Directorate, “Machine Learning + Libraries”; Padilla, “Responsible Operations”; Lorang et al., “Digital Libraries”; Cordell, “Machine Learning + Libraries”). Likewise, novel affordances from research in human-computer interaction (HCI), data visualization, and human-in-the-loop computing—considered subfields of computer science at my home institution—inform how digital humanities practitioners design and implement projects as far ranging as volunteer crowdsourcing platforms, public humanities exhibits, and interactive visualizations. User testing methodologies from HCI provide roadmaps for iteratively refining these digital humanities systems and interfaces. Lastly, computer science research at the intersection of artificial intelligence and HCI informs how exploratory search interfaces and recommender systems can better support digital content discovery. The digital humanities as a discipline is thus in constant conversation with computer science.
How, then, do the digital humanities relate to graduate study in computer science? I posit that grounding computer science graduate study in the digital humanities benefits not only the graduate students but also the fields of scholarship writ large. For graduate students in computer science, the digital humanities present the opportunity to study novel computer science ideas, algorithms, and affordances in practice, as well as to foster interdisciplinary collaborations and critically engage with the ethical implications of their work as demanded by proper digital humanities research. Conversely, graduate study in computer science has the potential to contribute emerging computational methodologies to digital humanities research, thereby widening the possibilities for humanistic inquiry with digital sources in both character and scale.
Let me unpack both of these provocations, beginning with the benefits to graduate students in computer science. Graduate research within machine learning, artificial intelligence, computer vision, natural language processing, human-in-the-loop computing, HCI, and visualization often involves deploying systems with novel algorithms, interfaces, or affordances and studying user activity via in-person and online user evaluations. For example, when studying a new recommendation algorithm or interface for content discovery, a computer science researcher may conduct a user study to answer questions that include the following: are users able to find more content of interest with the new algorithm or interface? Does the new algorithm or interface lead to increased user satisfaction? Does the user have more control over the new system? To deploy systems with large user bases, graduate students in computer science often partner with for-profit technology companies or test systems with crowd workers using platforms such as Amazon’s Mechanical Turk. Even when deployed with the intent of improving user experience and adhering to the principles of user-centered design, these large-scale deployments come at the expense of enmeshing computer science graduate research within the profit motives, invasive tracking, and exploitative labor practices of surveillance capitalism (Hara et al., “Data-Driven Analysis”; Zuboff, Age of Surveillance Capitalism).
Digital humanities projects have the potential to provide similar benefits for computer science graduate students while freeing the research itself from the profit structures and ethical complications inherent to computer science research in the context of for-profit industry. Indeed, concrete questions surrounding searching, visualizing, and semantifying digital collections, coupled with dedicated user groups, make collaborations with digital humanities practitioners and cultural heritage institutions a fruitful path for computer science graduate research. For example, volunteer crowdsourcing initiatives being launched across the world by digital humanities and cultural heritage practitioners have been overwhelmingly successful, routinely engaging many thousands of volunteers. These initiatives produce significant amounts of metadata, from transcriptions of entire collections to many thousands of bounding box annotations of visual content on historic newspaper pages (Ferriter, “Introducing Beyond Words”).1 However, as Trevor Owens argues, “Far better than being an instrument for generating data that we can use to get our collections more used, [crowdsourcing] is actually the single greatest advancement in getting people using and interacting with our collections” (“Crowdsourcing Cultural Heritage”). For computer science graduate students studying human-in-the-loop machine learning, partnering with these crowdsourcing initiatives thus represents a fantastic opportunity to study computer science questions of interest—from improving crowd workflows to understanding feedback mechanisms between humans and algorithms—while contributing to projects that play essential roles in engaging the public with cultural heritage collections. In addition, the datasets derived from these crowdsourcing initiatives provide high-quality sources of ethically collected training and evaluation data for graduate students researching topics as far-ranging as handwriting recognition and speech-to-text conversion.2 For graduate students studying search user interfaces, large-scale digital collections are ideal for studying how diverse user bases with authentic motivations explore and make sense of large volumes of data (Muralidharan, “Designing an Exploratory Text”). For graduate students seeking hands-on experience with project management, the digital humanities have much to offer, as described in chapter 23 in this volume, “Graduate Students and Project Management: A Humanities Perspective” by Meredith Martin, Natalia Ermolaev, and Rebecca Munson. The digital humanities as a field is grappling with its own systemic shortcomings surrounding the datafication of people, neocolonial thought, and the perpetuation of inequality, as well as its dependence on big tech for computing infrastructure (Noble, “Toward a Critical Black Digital Humanities”; Bartley, “Executing the Crisis: The University beyond Austerity,” chapter 4 in this volume). However, graduate research with the digital humanities and cultural heritage is nonetheless better positioned than industry-tied computer science research to foreground the interests of end users and communities.
Indeed, there is already a strong precedent of digital humanities research being carried out by computer scientists at the graduate level and beyond. Consider the work of Aditi Muralidharan, who developed WordSeer, an exploratory text analysis tool with application to literary analysis in collaboration with Marti Hearst (“Designing an Exploratory Text”); Laure Thompson, Xanda Schofield, and David Mimno, who have all advanced research on topic models and latent variable models while focusing on cultural heritage data (Thompson and Mimno, “Authorless Topic Models”; Schofield et al., “Quantifying the Effects”; Mimno and Blei, “Bayesian Checking for Topic Models”); David Bamman, who has pursued a number of empirical questions surrounding natural language processing in relation to the humanities with a specific focus on literature (Bamman et al., “Bayesian Mixed Effects Model”); and David Smith, whose research in natural language processing has spanned a range of directions, from OCR correction to modeling text reuse (Dong and Smith, “Multi-Input Attention”; Smith et al., “Detecting and Modeling”). All of these computer scientists incorporated the digital humanities into their graduate studies, and their highly varied research reveals how capacious this liminal space can be.
Collaborations between computer science graduate students and digital humanities practitioners not only foster interdisciplinary modes of thinking but also present manifold opportunities to marry graduate study in computer science with critical inquiry into the sociotechnical implications of research in the field. How do machine learning systems and classification taxonomies perpetuate racist and colonial oppression (Bowker and Star, Sorting Things Out; Noble, Algorithms of Oppression)? What are the consequences of reducing individuals and their experiences to data points (D’Ignazio and Klein, Data Feminism)? How might visualizations aestheticize and oversimplify difficult histories (Presner, “Ethics of the Algorithm”)? In accordance with the rich tradition of privacy-preserving librarianship, how might computationally informed digital humanities projects illustrate the capacity for large-scale user evaluations that emphasize privacy and autonomy? Because questions such as these are intrinsic to the modes of thinking encouraged by research in the digital humanities, the digital humanities have the potential to occupy a rich pedagogical role within computer science graduate study as well, which often suffers from little or no formal training in ethics, critical data studies, and science and technology studies.
Let me now turn to how graduate study in computer science can in turn benefit the digital humanities. As articulated earlier in this chapter, the methods of humanistic inquiry employed by digital humanities practitioners are inevitably informed by emerging methodologies in computer science. Research questions in the digital humanities are often formulated with a specific computational methodology in mind. An alternative to this paradigm is one in which digital humanities research questions are formulated in conjunction with computer scientists who can offer insight into methodological approaches otherwise inaccessible to those without graduate education in computer science or an adjacent field.3 Computer science graduate study thus presents a capacious opportunity for projecting emerging computer science methodologies back into the digital humanities via collaboration.
In the context of my research as a PhD student in computer science and engineering at the University of Washington, I partnered with the Library of Congress as part of the Library’s Innovator in Residence program to carry out a project that I named Newspaper Navigator. The goal of the project is to reimagine how the American public explores the Chronicling America database of sixteen million pages of digitized historic American newspapers (Lee, “Compounded Mediation”).4 In particular, the Newspaper Navigator project consists of two phases: extracting visual content including photographs, illustrations, maps, comics, editorial cartoons, headlines, and advertisements using machine learning techniques to produce the Newspaper Navigator dataset and subsequently building a search platform for users to explore the extracted visual content in the dataset (Lee et al., “Newspaper Navigator Dataset”; Lee and Weld, “Newspaper Navigator”). Within the capacity of my dissertation, Newspaper Navigator enabled me to study a range of computer science research questions in machine learning, artificial intelligence, and human-computer interaction, from information extraction to human-AI interaction. Within its capacity as a digital humanities project, Newspaper Navigator facilitates scholarship in the humanities by not only providing new affordances for searching and discovering visual content in historic newspapers but also widening the methodological possibilities for studying the visual content and exploring the sociotechnical implications of applying machine learning to cultural heritage data. Such possibilities include analyzing newspaper reproduction patterns of photographs, studying editorial practices as inferred from newspaper layout structures, and analyzing newspaper titles for which no reliable optical character recognition exists.
Partnering with the Library of Congress afforded me the opportunity to bridge my computer science graduate studies with my interest in the digital humanities and explore interdisciplinary questions as they relate to both computer science and the humanities. Indeed, Newspaper Navigator is only possible due to the collaboration with and input from a range of digital humanities and cultural heritage practitioners, including the LC Labs team at the Library of Congress; the National Digital Newspaper Program; IT design and development personnel at the Library of Congress; and my PhD advisor at the University of Washington, Professor Daniel Weld. Moreover, as a computer science project, Newspaper Navigator would not exist without the rich genealogy of public domain digital humanities projects at the Library of Congress, including not only Chronicling America but also the Beyond Words crowdsourcing initiative, which I utilized as a training dataset for the visual content recognition model.
I offer my personal experience with Newspaper Navigator as one example of how the digital humanities and graduate study in computer science can exist in symbiosis. Indeed, as digital humanities research and pedagogy continue to evolve within the university, computer science is well-situated, positioned to play an increasingly central role for graduate study across programs. Foregrounding computer science in this context fosters interdisciplinarity and benefits graduate students not only in humanities departments but also in computer science departments.
Notes
1. For examples of transcription projects, see the Library of Congress’s By the People, Zooniverse, Smithsonian’s Digital Volunteers, and the New York Public Library’s What’s on the Menu? and Emigrant Cities projects.
2. For an example, see the New York Public Library’s Oral History Project and Transcript Editor.
3. An exemplary project that has explored this liminal space for collaboration is the Viral Texts project led by Cordell and Smith at Northeastern University; see Cordell and Smith, Viral Texts.
4. The Chronicling America database is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic American newspapers.
Bibliography
- Arnold, Taylor, and Lauren Tilton. “Distant Viewing: Analyzing Large Visual Corpora.” Digital Scholarship in the Humanities 34, no. 1 (2019): i3-i16. https://doi.org/10.1093/digitalsh/fqz013.
- Bamman, David, Ted Underwood, and Noah Smith. “A Bayesian Mixed Effects Model of Literary Character.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 370–79. Association for Computational Linguistics, 2014. http://acl2014.org/acl2014/P14-1/pdf/P14-1035.pdf.
- Bowker, Geoffrey, and Susan Star. Sorting Things Out: Classification and Its Consequences. Cambridge, Mass.: The MIT Press, 2000.
- Cordell, Ryan. “Machine Learning + Libraries: A Report on the State of the Field.” LC Labs, Library of Congress. July 14, 2020. https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf.
- Cordell, Ryan, and David Smith. “Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines.” Accessed January 12, 2022. http://viraltexts.org.
- D’Ignazio, Catherine, and Lauren Klein. Data Feminism. Cambridge, Mass.: The MIT Press, 2020.
- Dong, R., and Smith, D. “Multi-Input Attention for Unsupervised OCR Correction.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2363–72. Melbourne: Association for Computational Linguistics, 2018. https://www.aclweb.org/anthology/P18-1220.
- Ferriter, Meghan. “Introducing Beyond Words.” The Signal (blog), September 28, 2017. https://blogs.loc.gov/thesignal/2017/09/introducing-beyond-words/.
- Hara, Kotaro, Abigail Adams, Kristy Milland, Saiph Savage, Chris Callison-Burch, and Jeffrey Bigham. “A Data-Driven Analysis of Workers’ Earnings on Amazon Mechanical Turk.” In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–14. New York: Association for Computing Machinery, 2018. https://doi.org/10.1145/3173574.3174023.
- Jakeway, Eileen, Lauren Algee, Laurie Allen, Meghan Ferriter, Jaime Mears, Abigail Potter, and Kate Zwaard. “Machine Learning + Libraries Summit Event Summary.” LC Labs and Digital Strategy Directorate, Library of Congress. February 13, 2020. https://labs.loc.gov/static/labs/meta/ML-Event-Summary-Final-2020-02-13.pdf.
- Lee, Benjamin. “Compounded Mediation: A Data Archaeology of the Newspaper Navigator Dataset.” Digital Humanities Quarterly 15, no. 4 (2021). http://digitalhumanities.org/dhq/vol/15/4/000578/000578.html.
- Lee, Benjamin, Jaime Mears, Eileen Jakeway, Meghan Ferriter, Chris Adams, Nathan Yarasavage, Deborah Thomas, Kate Zwaard, and Daniel Weld. “The Newspaper Navigator Dataset: Extracting and Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America.” In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, 3055–62. New York: Association for Computing Machinery, 2020. https://doi.org/10.1145/3340531.3412767.
- Lee, Benjamin, and Daniel Weld. “Newspaper Navigator: Open Faceted Search for 1.5 Million Images.” In Adjunct Publication of the 33rd Annual ACM Symposium on User Interface Software and Technology, 120–22. New York: Association for Computing Machinery, 2020. https://dl.acm.org/doi/10.1145/3379350.3416143.
- Lorang, Elizabeth, Leen-Kiat Soh, Yi Liu, and Chulwoo Pack. “Digital Libraries, Intelligent Data Analytics, and Augmented Description: A Demonstration Project.” University of Nebraska-Lincoln Faculty Publications. Digital Commons. January 10, 2020. https://digitalcommons.unl.edu/libraryscience/396/.
- Mimno, David, and David Blei. “Bayesian Checking for Topic Models.” In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 227–37. New York: Association for Computational Linguistics, 2011. https://dl.acm.org/doi/10.5555/2145432.2145459.
- Muralidharan, Aditi. “Designing an Exploratory Text Analysis Tool for Humanities and Social Sciences Research.” PhD diss., University of California, Berkeley, 2013. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-203.pdf.
- Noble, Safiya. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: New York University Press, 2018.
- Noble, Safiya. “Toward a Critical Black Digital Humanities.” In Debates in the Digital Humanities 2019, edited by Matthew Gold and Lauren Klein. Minneapolis: University of Minnesota Press, 2019. https://doi.org/10.5749/j.ctvg251hk.
- Owens, Trevor. “Crowdsourcing Cultural Heritage: The Objectives Are Upside Down.” March 10, 2012. Accessed July 10, 2020. http://www.trevorowens.org/2012/03/crowdsourcing-cultural-heritage-the-objectives-are-upside-down/.
- Padilla, Thomas. Responsible Operations: Data Science, Machine Learning, and AI in Libraries. Dublin: OCLC Research, 2019. https://doi.org/10.25333/xk7z-9g97.
- Presner, Todd. “The Ethics of the Algorithm: Close and Distant Listening to the Shoah Foundation Visual History Archive.” In Probing the Ethics of Holocaust Culture, edited by Claudio Fogu, Wolf Kansteiner, and Todd Presner, 175–202. Cambridge, Mass.: Harvard University Press, 2016.
- Schofield, Alexandra, Laure Thompson, and David Mimno. “Quantifying the Effects of Text Duplication on Semantic Models.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, edited by Martha Palmer, Rebecca Hwa, and Sebastian Riedel, 2737–47. Stroudsburg, Pa.: Association for Computational Linguistics, 2017, http://dx.doi.org/10.18653/v1/D17-1290.
- Smith, David, Ryan Cordell, Elizabeth Dillon, Nick Stramp, and John Wilkerson. “Detecting and Modeling Local Text Reuse.” In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, 183–92. Piscataway, N.J.: IEEE Press, 2014. https://dl.acm.org/doi/10.5555/2740769.2740800.
- Thompson, Laure, and David Mimno. “Authorless Topic Models: Biasing Models Away from Known Structure.” In Proceedings of the 27th International Conference on Computational Linguistics, edited by Emily M. Bender, Leon Derczynski, and Pierre Isabelle, 3903–14. Santa Fe, N.Mex.: Association for Computational Linguistics, 2018. https://www.aclweb.org/anthology/C18-1329/.
- Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Cultural Analytics 3, no. 2 (2018): 1–25. https://culturalanalytics.org/article/11035.
- Wevers, Melvin, and Thomas Smits. “The Visual Digital Turn: Using Neural Networks to Study Historical Images.” Digital Scholarship in the Humanities 35, no. 1 (2019): 194–207. https://doi.org/10.1093/llc/fqy085.
- Zuboff, Shoshana. The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. New York: PublicAffairs, 2019.