49
Messy Data and Faulty Tools
Joanna Swafford
With our newfound access to unprecedented levels of data, we can ask questions we could not have dreamed of twenty years ago and better answer questions that would previously have taken scholars a lifetime to address: by examining thousands or millions of texts, we can learn about the Great Unread (Cohen, 23), look for changes in periodicals through topic modeling (Nelson), and examine changes in poetic meter over the centuries (Algee-Hewitt et al.). However, unless we focus more on creating quality-control systems for our work, we run the risk of drawing erroneous conclusions based on messy data and faulty tools.
Humanities scholars are used to knowing the details about their data: in literature, for example, we work with poems, essays, plays, and periodicals as well as publishing records, census data, and other number-driven documents. We often gather the information ourselves, so we usually know the origin and quality of our materials. This is not always true for large-scale text analysis projects: HathiTrust, Project Gutenberg, and the Internet Archive have a plethora of works in plain-text format, but the quality of the optical character recognition (OCR) can be unreliable.1 No individual scholar can read and proofread each text, so the texts we use will have errors, from small typos to missing chapters, which may cause problems in the aggregate.2 Ideally, to address this issue, scholars could create a large, collaboratively edited collection of plain-text versions of literary works that would be open access. The Eighteenth-Century Collections Online Text Creation Partnership,3 the Early Modern OCR Project,4 and the Corpus of Historical American English5 have helpfully created repositories of texts, through both manual entry and automated OCR correction, but they still represent a comparatively small portion of all texts online.
In addition to clean data, we also need robust, well-tested tools. Traditional scholarship relies on peer review as a means of quality control, and groups like the Advanced Research Consortium6 do peer-review digital archives. Unfortunately, digital humanities does not currently have a system for peer-reviewing tools. Although digital humanities scholars occasionally post their code online, members of our field are still learning to embrace the open-source philosophy of reviewing each other’s code and making suggestions for improvement.7 As a result, scholars either consult the DiRT directory8 or informally recommend tools on Twitter.9 As useful as these systems are, they do not have the rigor that peer review should provide. Certainly a peer-review system for tools presents serious challenges: we may not have enough scholars with programming expertise in the field that the tool supports to comprise a peer-review board; the variety of programming languages and documentation styles people use may also present a problem; and we lack a model for peer-reviewing projects that change. Nevertheless, we need to address these challenges if our data and tools are to meet the quality standards necessary to ensure the continued strength of digital humanities research.10
The software package Syuzhet demonstrates this necessity. Syuzhet uses sentiment analysis to graph the emotional trajectory of novels. It was released on GitHub to instant acclaim, both in digital humanities circles and the popular press (Clancy). Unfortunately, the package incorporated an overly simplified version of sentiment analysis and a poorly chosen signal processing filter; the latter problem in particular led to distortions that, in extreme cases, could actually invert the results such that the tool reported emotional highs at the novel’s emotional lows (Swafford). These errors initially escaped the notice of those doing a more informal peer-review process over Twitter, as they accepted the maker’s claims without interrogating the tool’s methodology or code, and the errors were only acknowledged after a drawn-out blog exchange, the public nature of which encouraged the tool’s supporters to double-down on their claims. If the tool had been peer-reviewed by experts in the field of sentiment analysis, signal processing, and English literature, it would have been altered at an earlier stage, producing a more reliable tool and, ultimately, better scholarship.
In the interim, we can make our tools for large-scale text analysis more accurate by collaborating with programmers and experts in other fields. For example, when using sentiment analysis, we could work with specialists in natural language processing and in marketing to create a human-annotated corpora of texts, from which we could estimate how an “average” reader evaluates documents and measure how well their algorithms approximate that “average” reader to make sure our tools work. Ultimately, then, while conceptions of words and analytical goals may differ between programmers, humanists, and marketing executives, our scholarship would benefit from a closer partnership.
Notes
1. For information on the challenges of using OCR to study nineteenth-century newspapers, see Strange et al.
2. Tesseract is the most prominent open-source OCR option, and Ted Underwood has written some enhancements to improve OCR quality of eighteenth-century texts in HathiTrust, but both of these projects, by their own admission, still require human corrections.
7. According to Peter Rigby, peer-reviewing open-source software “involves . . . early, frequent reviews . . . of small, independent, complete contributions . . . that are broadcast to a large group of stakeholders, but only reviewed by a small set of self-selected experts” (26).
9. The Journal of Digital Humanities has a handful of tool reviews, but these reviews, although helpful, do not actually address the code itself.
10. As an added bonus, having a tool peer-reviewed would help programmer-scholars count their project as scholarship rather than service for tenure and promotion.
Bibliography
Algee-Hewitt, Mark, Ryan Heuser, Maria Kraxenberger, J. D. Porter, Jonny Sensenbaugh, and Justin Tackett. “The Stanford Literary Lab Transhistorical Poetry Project Phase II: Metrical Form.” Paper prepared for the Digital Humanities Conference, Lausanne, Switzerland, July 11, 2014. http://dharchive.org/paper/DH2014/Paper-788.xml.
Clancy, Eileen. “A Fabula of Syuzhet: A Contretemps of Digital Humanities (with Tweets).” Storify. Accessed July 17, 2015. https://storify.com/clancynewyork/contretemps-a-syuzhet.
Cohen, Margaret. The Sentimental Education of the Novel. Princeton, N.J.: Princeton University Press. 1999.
Nelson, Robert K. “Mining the Dispatch.” Digital Scholarship Lab, University of Richmond. Accessed April 30, 2015. http://dsl.richmond.edu/dispatch/.
Rigby, Peter. “Peer Review on Open-Source Software Projects: Parameters, Statistical Models, and Theory.” ACM Transactions on Software Engineering and Methodology 23, no. 4 (August 2014): 1–33.
Strange, Carolyn, Daniel McNamara, Josh Wodak, and Ian Wood. “Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers.” Digital Humanities Quarterly 8, no.1 (2014). http://www.digitalhumanities.org/dhq/vol/8/1/000168/000168.html.
Swafford, Joanna. “Problems with the Syuzhet Package.” Anglophile in Academia: Annie Swafford’s Blog, March 2, 2015. https://annieswafford.wordpress.com/2015/03/02/syuzhet/.