Skip to main content

Debates in the Digital Humanities 2016: 49. Messy Data and Faulty Tools | Joanna Swafford

Debates in the Digital Humanities 2016
49. Messy Data and Faulty Tools | Joanna Swafford
  • Show the following:

    Annotations
    Resources
  • Adjust appearance:

    Font
    Font style
    Color Scheme
    Light
    Dark
    Annotation contrast
    Low
    High
    Margins
  • Search within:
    • Notifications
    • Privacy
  • Project HomeDebates in the Digital Humanities 2016
  • Projects
  • Learn more about Manifold

Notes

table of contents
  1. Cover
  2. Title Page
  3. Copyright Page
  4. Contents
  5. Digital Humanities: The Expanded Field | Lauren F. Klein and Matthew K. Gold
  6. Part 1. Histories and Futures of the Digital Humanities
    1. 1. The Emergence of the Digital Humanities (as the Network Is Everting) | Steven E. Jones
    2. 2. The “Whole Game”: Digital Humanities at Community Colleges | Anne B. McGrail
    3. 3. What’s Next: The Radical, Unrealized Potential of Digital Humanities | Miriam Posner
    4. 4. Making a Case for the Black Digital Humanities | Kim Gallon
    5. 5. QueerOS: A User’s Manual | Fiona Barnett, Zach Blas, Micha Cárdenas, Jacob Gaboury, Jessica Marie Johnson, and Margaret Rhee
    6. 6. Father Busa’s Female Punch Card Operatives | Melissa Terras and Julianne Nyhan
    7. 7. On the Origin of “Hack” and “Yack” | Bethany Nowviskie
    8. 8. Reflections on a Movement: #transformDH, Growing Up | Moya Bailey, Anne Cong-Huyen, Alexis Lothian, and Amanda Phillips
  7. Part 2. Digital Humanities and Its Methods
    1. 9. Blunt Instrumentalism: On Tools and Methods | Dennis Tenen
    2. 10. Putting the Human Back into the Digital Humanities: Feminism, Generosity, and Mess | Elizabeth Losh, Jacqueline Wernimont, Laura Wexler, and Hong-An Wu
    3. 11. Mid-Sized Digital Pedagogy | Paul Fyfe
    4. 12. Re: Search and Close Reading | Michael Hancher
    5. 13. Why We Must Read the Code: The Science Wars, Episode IV | Mark C. Marino
    6. 14. Where Is Methodology in Digital Humanities? | Tanya E. Clement
    7. 15. Resistance in the Materials | Bethany Nowviskie
    8. 16. Interview with Ernesto Oroza | Alex Gil
    9. 17. Digital Humanities Knowledge: Reflections on the Introductory Graduate Syllabus | Scott Selisker
  8. Part 3. Digital Humanities and Its Practices
    1. 18. Alien Reading: Text Mining, Language Standardization, and the Humanities | Jeffrey M. Binder
    2. 19. My Old Sweethearts: On Digitization and the Future of the Print Record | Andrew Stauffer
    3. 20. Argument, Evidence, and the Limits of Digital Literary Studies | David L. Hoover
    4. 21. Pedagogies of Race: Digital Humanities in the Age of Ferguson | Amy E. Earhart and Toniesha L. Taylor
    5. 22. Here and There: Creating DH Community | Miriam Posner
    6. 23. The Sympathetic Research Imagination: Digital Humanities and the Liberal Arts | Rachel Sagner Buurma and Anna Tione Levine
    7. 24. Lessons on Public Humanities from the Civic Sphere | Wendy F. Hsu
  9. Part 4. Digital Humanities and the Disciplines
    1. 25. The Differences between Digital Humanities and Digital History | Stephen Robertson
    2. 26. Digital History’s Perpetual Future Tense | Cameron Blevins
    3. 27. Collections and/of Data: Art History and the Art Museum in the DH Mode | Matthew Battles and Michael Maizels
    4. 28. Archaeology, the Digital Humanities, and the “Big Tent” | Ethan Watrall
    5. 29. Navigating the Global Digital Humanities: Insights from Black Feminism | Roopika Risam
    6. 30. Between Knowledge and Metaknowledge: Shifting Disciplinary Borders in Digital Humanities and Library and Information Studies | Jonathan Senchyne
    7. 31. “Black Printers” on White Cards: Information Architecture in the Data Structures of the Early American Book Trades | Molly O’Hagan Hardy
    8. 32. Public, First | Sheila A. Brennan
  10. Part 5. Digital Humanities and Its Critics
    1. 33. Are Digital Humanists Utopian? | Brian Greenspan
    2. 34. Ecological Entanglements of DH | Margaret Linley
    3. 35. Toward a Cultural Critique of Digital Humanities | Domenico Fiormonte
    4. 36. How Not to Teach Digital Humanities | Ryan Cordell
    5. 37. Dropping the Digital | Jentery Sayers
    6. 38. The Dark Side of the Digital Humanities | Wendy Hui Kyong Chun, Richard Grusin, Patrick Jagoda, and Rita Raley
    7. 39. Difficult Thinking about the Digital Humanities | Mark Sample
    8. 40. The Humane Digital | Timothy Burke
    9. 41. Hold on Loosely, or Gemeinschaft and Gesellschaft on the Web | Ted Underwood
  11. Part 6. Forum: Text Analysis at Scale
    1. 42. Introduction | Matthew K. Gold and Lauren F. Klein
    2. 43. Humane Computation | Stephen Ramsay
    3. 44. Distant Reading and Recent Intellectual History | Ted Underwood
    4. 45. The Ground Truth of DH Text Mining | Tanya E. Clement
    5. 46. Why I Dig: Feminist Approaches to Text Analysis | Lisa Marie Rhody
    6. 47. More Scale, More Questions: Observations from Sociology | Tressie McMillan Cottom
    7. 48. Do Digital Humanists Need to Understand Algorithms? | Benjamin M. Schmidt
    8. 49. Messy Data and Faulty Tools | Joanna Swafford
    9. 50. N + 1: A Plea for Cross-Domain Data in the Digital Humanities | Alan Liu
  12. Series Introduction and Editors’ Note | Matthew K. Gold and Lauren F. Klein
  13. Contributors

49

Messy Data and Faulty Tools

Joanna Swafford

With our newfound access to unprecedented levels of data, we can ask questions we could not have dreamed of twenty years ago and better answer questions that would previously have taken scholars a lifetime to address: by examining thousands or millions of texts, we can learn about the Great Unread (Cohen, 23), look for changes in periodicals through topic modeling (Nelson), and examine changes in poetic meter over the centuries (Algee-Hewitt et al.). However, unless we focus more on creating quality-control systems for our work, we run the risk of drawing erroneous conclusions based on messy data and faulty tools.

Humanities scholars are used to knowing the details about their data: in literature, for example, we work with poems, essays, plays, and periodicals as well as publishing records, census data, and other number-driven documents. We often gather the information ourselves, so we usually know the origin and quality of our materials. This is not always true for large-scale text analysis projects: HathiTrust, Project Gutenberg, and the Internet Archive have a plethora of works in plain-text format, but the quality of the optical character recognition (OCR) can be unreliable.1 No individual scholar can read and proofread each text, so the texts we use will have errors, from small typos to missing chapters, which may cause problems in the aggregate.2 Ideally, to address this issue, scholars could create a large, collaboratively edited collection of plain-text versions of literary works that would be open access. The Eighteenth-Century Collections Online Text Creation Partnership,3 the Early Modern OCR Project,4 and the Corpus of Historical American English5 have helpfully created repositories of texts, through both manual entry and automated OCR correction, but they still represent a comparatively small portion of all texts online.

In addition to clean data, we also need robust, well-tested tools. Traditional scholarship relies on peer review as a means of quality control, and groups like the Advanced Research Consortium6 do peer-review digital archives. Unfortunately, digital humanities does not currently have a system for peer-reviewing tools. Although digital humanities scholars occasionally post their code online, members of our field are still learning to embrace the open-source philosophy of reviewing each other’s code and making suggestions for improvement.7 As a result, scholars either consult the DiRT directory8 or informally recommend tools on Twitter.9 As useful as these systems are, they do not have the rigor that peer review should provide. Certainly a peer-review system for tools presents serious challenges: we may not have enough scholars with programming expertise in the field that the tool supports to comprise a peer-review board; the variety of programming languages and documentation styles people use may also present a problem; and we lack a model for peer-reviewing projects that change. Nevertheless, we need to address these challenges if our data and tools are to meet the quality standards necessary to ensure the continued strength of digital humanities research.10

The software package Syuzhet demonstrates this necessity. Syuzhet uses sentiment analysis to graph the emotional trajectory of novels. It was released on GitHub to instant acclaim, both in digital humanities circles and the popular press (Clancy). Unfortunately, the package incorporated an overly simplified version of sentiment analysis and a poorly chosen signal processing filter; the latter problem in particular led to distortions that, in extreme cases, could actually invert the results such that the tool reported emotional highs at the novel’s emotional lows (Swafford). These errors initially escaped the notice of those doing a more informal peer-review process over Twitter, as they accepted the maker’s claims without interrogating the tool’s methodology or code, and the errors were only acknowledged after a drawn-out blog exchange, the public nature of which encouraged the tool’s supporters to double-down on their claims. If the tool had been peer-reviewed by experts in the field of sentiment analysis, signal processing, and English literature, it would have been altered at an earlier stage, producing a more reliable tool and, ultimately, better scholarship.

In the interim, we can make our tools for large-scale text analysis more accurate by collaborating with programmers and experts in other fields. For example, when using sentiment analysis, we could work with specialists in natural language processing and in marketing to create a human-annotated corpora of texts, from which we could estimate how an “average” reader evaluates documents and measure how well their algorithms approximate that “average” reader to make sure our tools work. Ultimately, then, while conceptions of words and analytical goals may differ between programmers, humanists, and marketing executives, our scholarship would benefit from a closer partnership.

Notes

1. For information on the challenges of using OCR to study nineteenth-century newspapers, see Strange et al.

2. Tesseract is the most prominent open-source OCR option, and Ted Underwood has written some enhancements to improve OCR quality of eighteenth-century texts in HathiTrust, but both of these projects, by their own admission, still require human corrections.

3. http://www.textcreationpartnership.org/tcp-ecco.

4. http://emop.tamu.edu.

5. http://corpus.byu.edu/coha.

6. http://idhmc.tamu.edu/arcgrant/nodes.

7. According to Peter Rigby, peer-reviewing open-source software “involves . . . early, frequent reviews . . . of small, independent, complete contributions . . . that are broadcast to a large group of stakeholders, but only reviewed by a small set of self-selected experts” (26).

8. http://dirtdirectory.org.

9. The Journal of Digital Humanities has a handful of tool reviews, but these reviews, although helpful, do not actually address the code itself.

10. As an added bonus, having a tool peer-reviewed would help programmer-scholars count their project as scholarship rather than service for tenure and promotion.

Bibliography

Algee-Hewitt, Mark, Ryan Heuser, Maria Kraxenberger, J. D. Porter, Jonny Sensenbaugh, and Justin Tackett. “The Stanford Literary Lab Transhistorical Poetry Project Phase II: Metrical Form.” Paper prepared for the Digital Humanities Conference, Lausanne, Switzerland, July 11, 2014. http://dharchive.org/paper/DH2014/Paper-788.xml.

Clancy, Eileen. “A Fabula of Syuzhet: A Contretemps of Digital Humanities (with Tweets).” Storify. Accessed July 17, 2015. https://storify.com/clancynewyork/contretemps-a-syuzhet.

Cohen, Margaret. The Sentimental Education of the Novel. Princeton, N.J.: Princeton University Press. 1999.

Nelson, Robert K. “Mining the Dispatch.” Digital Scholarship Lab, University of Richmond. Accessed April 30, 2015. http://dsl.richmond.edu/dispatch/.

Rigby, Peter. “Peer Review on Open-Source Software Projects: Parameters, Statistical Models, and Theory.” ACM Transactions on Software Engineering and Methodology 23, no. 4 (August 2014): 1–33.

Strange, Carolyn, Daniel McNamara, Josh Wodak, and Ian Wood. “Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers.” Digital Humanities Quarterly 8, no.1 (2014). http://www.digitalhumanities.org/dhq/vol/8/1/000168/000168.html.

Swafford, Joanna. “Problems with the Syuzhet Package.” Anglophile in Academia: Annie Swafford’s Blog, March 2, 2015. https://annieswafford.wordpress.com/2015/03/02/syuzhet/.

Annotate

Next Chapter
50. N + 1: A Plea for Cross-Domain Data in the Digital Humanities | Alan Liu
PreviousNext
Copyright 2016 by the Regents of the University of Minnesota
Powered by Manifold Scholarship. Learn more at
Opens in new tab or windowmanifoldapp.org