50
N + 1: A Plea for Cross-Domain Data in the Digital Humanities
Alan Liu
Subtract the unique from the multiplicity,” Deleuze and Guattari write. That is one of their programs for remaking modernity’s repressive psychic and political-economic state into rhizomes and bodies-without-organs. Emancipate ourselves from totalizing forces, in other words, and see what happens. Their algorithm for that: “write at n – 1 dimensions” (6).
“Forms are the abstract of social relationships,” Franco Moretti writes in “Conjectures on World Literature,” introducing his program of distant reading (66). Later, in Graphs, Maps, Trees, he adds, “form [is revealed] as a diagram of forces; or perhaps, even, as nothing but force” (64). Expose the historical forces underlying the shapes of sociocultural experience, in other words, and conjecture what else might have happened. Moretti’s algorithm for that is forms (plural) or, put another way, n + 1.
Though opposite in expression, Moretti’s n + 1 abstract diagram and Deleuze and Guattari’s n – 1 abstract machine (another of their neonisms) compile to the same criticism of the modern state of being. Both are in form and at root a critique—including but not limited to socioeconomic and political critique—of the notion of one form stemming from one root. Both are analyses of modernity’s paradoxical power at once to compress people into apparently unitary forms (the state, the masses, the market, the zeitgeist) and to reveal—through the very transformations of modernity—the radical principle of the multiplicity of forms (Marx: modes of production; Bakhtin: heteroglossia; Williams: structures of feeling; Foucault: epistemes; Derrida: différance; Barthes: galaxy of signifiers; Cixous: woman’s body; McGann: radiant textuality; Schumpeter: creative destruction).
Among the forms at stake are discursive ones of the kind that Moretti and the Stanford Literary Lab so brilliantly study. A compact example is the “Maps” chapter in Moretti’s Graphs, Maps, Trees, which shows how the modern state distorted the form of spatial imagination in nineteenth-century “village stories” from circular, local idylls into ever more attenuated, transected geonetworks reoriented toward state power over the horizon. Moretti’s point is not just to add more points on the map. It is to discern the plural, and differential, patterns bracketing the epistemological before and after of historical change. Before manifests in patterns knowable in stabilized forms of human expression, especially middle-level forms such as discursive genres, narrative plots, and geospatial maps mediating between macro-social and individual experience. After emerges in new patterns whose understanding requires new forms. At the moving terminator between epistemological day and night hides the grail of Moretti’s distant-reading quest: the social, political, economic, and cultural forces that compel the churn of forms.
In the digital humanities, n – 1 is sometimes called deformance (in new media studies, glitch): a deliberate subtraction in unity of form designed to surface differences and rifts of force in known forms.
N + 1, by contrast, is scale, big data, etc. Typically, big data is understood to mean addition in units of form—for example, more documents or images in a corpus. Tacitly, though, incrementing units is acknowledged to corrode unity of form. Big data is messy, requiring cleaning, scrubbing, wrangling, munging, etc. In other words, it’s glitchy. At a low level, noise from problems in OCR, character encoding, formatting, and extraneous material is often uninteresting. But the picture is different at the higher level of the analytics brought to bear on big data, which draw on the mathematics of informational “noise” (stochastic process, statistical variance, and probability theory). It is thus profoundly interesting what human observers find meaningful versus noisy, for example, in topic models, cluster analyses, data visualizations, and so on. Some patterns of noisy interference between correlation and difference (e.g., a topic in a topic model that only partly seems to make sense) come to attention as dramatic, meaning that we feel there is some humanly significant force behind the interference.
N – 1 and n + 1 are thus a circuit: n ± 1. Really, anything to escape regimes of unity and their “monotonous nights” (one of Foucault’s favorite zingers).1
Let me focus here on n + 1, which in my view is not yet robust enough in the digital humanities. How big is digital humanities data anyway, where the crucial measure is not just terabytes or unit-items but, to begin with, facets (as in the information-studies sense of “faceted search”)?2 Let’s count:
One: My guess is that among branches of the digital humanities, unifacial corpora (e.g., collections of just one kind of material such as novels) occur most regularly in digital literary studies.3 After the great formalist intervention of the twentieth century, literary scholars still largely focus on the differentia specifica of the authors, works, genres (in Russian Formalism, also system of genres), styles, movements, and so on of literature in its modern rather than historically more catholic sense. Even when adventuring outside literature, digital literary scholars often like single-malt brews—for instance, a corpus just of specific kinds of journalistic, historical, popular culture, and other sources.
Couplethree: Unifacial corpora, of course, are a Platonic ideal. In reality, two or three other facets are needed. For example, a corpus can be a genre of documents (first facet) in a chronological period (second facet) and nation (third facet). However, added facets are often unifacial in spirit because, whether due to focus or missing data, they narrow inquiry—just to the intersection of a genre, century, and nation, for example.
Further facets could be instanced, but they are not typical. Digital history has a broader remit because its materials are often (though they need not be) more diverse in genre and provenance. Consider, for instance, the amplitude of The Old Bailey Proceedings Online (Hitchcock et al.), whose corpus includes the Old Bailey Proceedings from 1674 to 1913, the Ordinary of Newgate’s Accounts, advertisements in the Proceedings (brought forward for attention rather than filtered out), and guides and links to “Associated Records” (one descriptive section of which ventures into the literary and artistic realm of “Novels and Satirical Prints”).4 Of course, projects like The Old Bailey are relatively big, well-funded, or old enough to have extended their coverage over the years. But there is no necessary relation between being big-$-old and multifaceted. A solo digital humanist with little or no funding, for instance, could assemble a corpus of a few hundred literary works plus a few hundred sermons, ballads, political speeches, legislative acts, or newspaper articles from the same era. The crucial multiplier is not big-$-old, but number of facets.5
In this regard, it can be predicted that an important next-generation step for the digital humanities will be to scrape and aggregate content from the many smaller digital humanities projects whose mixed-media or multifaceted materials were conceived first of all as collections, editions, exhibitions, teaching materials, single-author or single-artist corpora, timelines, Neatline maps, story maps, and so on. Beyond their original missions, these can be tributaries into large-scale, multifaceted corpora designed for text analysis, visual analysis, social-network and prosopographical study, and other data mining at scale. An example is the Social Networks and Archival Contexts project (SNAC), which repurposes archival and other collection finding aids to release their prosopographical data for multifaceted views of relations between collections, persons, families, organizations, and so on. Or consider what might be done in the future with the proliferating story maps created in such platforms as ArcGIS, Odyssey.js, and StoryMapJS by instructors and students (and professionals in other fields such as journalism). The mixed texts, images, links, and so on in story maps might be scraped and aggregated so that their original narrative form is opened up to big-data analytics and other follow-on forms of understanding—for example, cluster analyses and topic models that, as in Benjamin M. Schmidt’s innovative work on historical ship logs, include map coordinates.
The goal of multiplying facets in digital humanities corpora, whether by designing projects for the purpose or aggregating existing projects, is to bring such corpora (not the same as corpus linguistics corpora6) closer to what may well be their implicit, if rarely attainable n + 1 ideal: archives. Pushing back against the loose appropriation of the term, research archivists have recently reminded digital humanists what the concept actually means. In an article in Journal of Digital Humanities, for example, Kate Theimer quotes the first definition of archives endorsed by the Society of American Archivists:
Materials created or received by a person, family, or organization, public or private, in the conduct of their affairs and preserved because of the enduring value contained in the information they contain or as evidence of the functions and responsibilities of their creator, especially those materials maintained using the principles of provenance, original order, and collective control.
Then, repudiating the premise of many putative archives in the digital humanities, she comments: “There is nothing in this meaning of ‘archives’ that references a selection activity on the part of the archivist” (Theimer). In other words, while archives are indeed rigorous about the materials they ingest over their “archival threshold” (as Luciana Duranti calls it, with allusion to the architecture of the Roman Tabularium), their selection rigor is not keyed to purity of materials for a researcher’s purpose. Archives are instead often n + 1 mixes of materials witnessing the identity (and other needs) of the persons, organizations, cities, states, etc. that is their raison d’être.
Sparseness of facets, however, only gets us to the threshold of the more general lack in the digital humanities. That lack is data domains, a term I use here to mean the ontological, epistemological, formal, and social-political-economic provenances—put more generally, contexts—in which datasets arise no matter how richly or poorly faceted. Most digital humanities projects do not work with enough such domains at any one time. That means that they do not functionally encounter enough exogamous otherness. After all, corpora from exogamous domains challenge scholars’ assumed or trained “feel” for their data, including not just for their dataset’s explicit metadata but, more challenging, for the implicit assumptions governing how their dataset was originally constituted—what was not collected, what was overrepresented, what was ingested only as ephemera in bins labeled something like “materials from the Seattle 1962 World’s Fair,” etc. On this point, Noortje Marres and Esther Weltevrede’s “Scraping the Social?” is insightful. There is never any scraping of unstructured data, they observe. (As Lisa Gitelman puts it in the title of her edited volume, “‘raw data’ is an oxymoron.”) Instead, there is only the scraping of “already formatted data” whose pre-encoded structures, even when seemingly unstructured, inject “‘alien’ assumptions” (Marres and Weltevrede, 315).
For me, alien assumptions are what it’s all about. That is the big promise of big data. I think playing it safe in endogamous domains will always get the digital humanities just to the point of producing interesting demos. Few game-changing discoveries will arise to satisfy the Idiot Questioners, as William Blake might have called them (Milton a Poem, object 43),7 who keep asking doggedly: do the digital humanities really discover anything new? That is a rigged question because it is implicitly completed, for example, about literature or history. Domains of knowledge such as literature and history are pretuned to the predigital scholarly methods that arose over the past two-plus centuries (roughly since Blake’s time, in fact) to define those domains as objects of inquiry. In such a feedback loop, other kinds of questions premised on other definitions of the object of inquiry are ruled out of scope—elided, marginalized, or at best, as in the case of the New Historicism’s attempt at cross-domain inquiry without adequate technical means, “anecdotalized.”8
But the etymological root of demo—monstrāre (to show, point out)—forks off rhizomatically to suggest something else corpora can demonstrate: monsters. The monsters of the digital humanities are the alien assumptions living in the cracks between familiar and unfamiliar domains of big data. A data domain is like a house in a gothic novel or film. It is haunted by something as near, yet far, as the attic. Who lives in the attic? Messy archives of ancestors, yesteryears, and other ghostly revenants of what today’s domains try to unknow. Really, the digital humanities should go bump in the night. The field is currently not scary enough.
N + 1 should be the equation for transacting between a scholar’s familiar domain and scary alien ontologies, epistemologies, forms, sociologies, economics, politics, and (perhaps scariest for humanists) sciences, including information science. Such a transaction would not just demonstrate but “operationalize” the messiness of humanities data as something more than accident or inconvenience (borrowing the powerful “operationalizing” concept from Moretti). After all, data mess witnesses fundamental collisions of logic and sociologic occurring in the background—for example, between different media, classificatory ontologies, and social-political-economic views of what should be recorded, how, and with what openness to other communities. Messiness is thus too valuable just to be “cleaned” from the corpus (scrubbed or filtered) in the interest of pure gene lines of genres, periods, nations, and languages.
So here is my program for the digital humanities. I hope digital humanists can build more experimental, cross-domain corpora designed on purpose to be other than tidy. Digital humanists should make corpora that mix disciplines, provenances, formats, metadata structures, and so on to remix the evidence of the human.
This need not mean a “culturomics” surfacing apparent trends from googols of documents with uncontrolled metadata and provenances.9 It can, and should, mean highly controlled experiments with precisely defined cross-domain corpora—for example, novels plus just one or two of the following at a time gathered from controlled provenances: newspapers, advertisements, song lyrics, legal papers, political speeches, architectural documents, etc. A fascinating attempt at something like such an experiment is Tim Hitchcock’s Voices of Authority project, using materials from the Old Bailey Proceedings mixed with other sources (Hitchcock).
Exploring cross-domains of data in this way would advance the digital humanities both technically and on a broader methodological front. Technically, cross-domain corpora pose intriguing next-frontier challenges. For instance, how do we use computation to negotiate between the metadata ontologies of one knowledge domain and another along the lines of the “fluid ontologies” theorized by Ramesh Srinivasan and his collaborators? Their work focuses on combined ethnographical and computational ways to bridge the knowledge protocols of indigenous peoples and the museums that collect their artifacts (Srinivasan and Huang; Srinivasan, Pepe, and Rodriguez; Srinivasan et al.).
But it is the broader methodological (and cultural-critical) challenge of cross-domain corpora—motivating Srinivasan and his collaborators as well—that is the research sweet spot. Sorting through facets of genre, period, and nation in a single domain, digital humanists hunt for the unknown in such evidence as frequencies of collocated words that reveal collective, larger, and to some degree, alien mentalities. Such mentalities are like what Timothy Morton calls hyperobjects exceeding the established interpretive lens frames of the author, work, generation, movement, etc. We think our diagrams, topic models, social network graphs, maps, and so on are the grainy photographs of those great aliens among us—a scholarly version of a Yeti or UFO or, perhaps better, an ant’s guess at a “human” walking on its nest.
But that is just a 2D silhouette or at best 3D maquette. What if we could collaborate across fields to create other views from multiple disciplinary, generic, period, national, and other domain angles—like an MRI resonating not just in the single torpedo tube down which a patient today is inserted for data slicing, but in some fantastic multi-axial scanning array showing the human body, the literal corpus, from multiple simultaneous unexpected angles?10 And what if those multi-angled slices revealed in composite the true many-splendored profile of the alien: the forms of ourselves caught in, but shaking against, the nets of modernity, neoliberalism, and everything else trying to aggregate capital out of n = 1 profiles of our selves, our precious n ± 1 human selves?
What if the digital humanities did text analysis, which would also be cultural criticism, like that?
Notes
My thanks to Lisa Marie Rhody for her astute, extremely helpful open peer review of this essay. I have added ideas and examples, and altered others, in response to her suggestions.
1. For example, “Not so long ago, [madness] had floundered about in broad daylight. . . . But in less than a half-century, it had been sequestered and, in the fortress of confinement, bound to Reason, to the rules of morality and to their monotonous nights” (Foucault, Madness, 64); and “But twilight soon fell upon this bright day, followed by the monotonous nights of the Victorian bourgeoisie” (Foucault, History of Sexuality, 3).
2. “Faceted classification decomposes compound subjects into foci in component facets, offering expressive power and flexibility through the independence of the facets” (Tunkelang, 9).
3. Some points about my nomenclature: digital humanities here means primarily digital work created by or in relation to fields traditionally recognized as “the humanities,” especially such older disciplines as history, literature, classics, the languages, corpus linguistics, etc. that were associated with so-called humanities computing before it evolved and widened into the now self-aware and professionalized field of “digital humanities.” I thus do not encompass work in “new media studies and arts” or digital work in the social sciences and information science. The conversation is ongoing about how the digital humanities narrowly defined should engage more fully with other areas of the digital research space. After all, there are overlaps of interests, people, and projects among all these areas even if their convergence has not yet been realized at the programmatic and institutional level. But my intervention in this essay is addressed to the digital humanities in its present, institutionally recognized compass.
In regard to corpus and corpora: I use these terms in the vernacular digital-humanities sense of a focused collection of materials—for example, nineteenth-century British novels, historical newspaper articles, etc.—assembled for the study of a particular subject area using digital methods. Except as indicated in note 6, I thus do not mean corpora created in the corpus linguistics field—for example, samples of texts from a nation and era chosen to be broadly representative of language usage (e.g., Davies, Corpus of Historical American English).
4. I omit here consideration of digital work in near-humanities fields such as archaeology, which may be multifaceted by definition because the organizing principle for data collection and analysis (for example) is what is found at a site as opposed to any isolated kind of artifact or document.
5. Katie Trumpener’s criticism of the digital humanities for lacking a robust comparative literature viewpoint and Domenico Fiormonte’s parallel criticism of the field’s Anglophone-centrism suggest that even the minimal threshold of a cross-national or cross-language n + 1 is difficult to achieve for a variety of reasons.
6. I focus here on the contrast between digital humanities corpora and archives. But there is at least one other important contrast to be considered: between digital humanities corpora and corpora created for corpus linguistics research. Putting corpus linguistics in the picture opens up a fascinating triangular conceptual space. Corpora in corpus linguistics (e.g., Davies, Corpus of Historical American English) are unlike the usual digital humanities corpora; they include more than a particular author, genre, etc. But they are also unlike archives; they select representative language from multiple, mixed provenances. Nevertheless, like digital humanities corpora, they do focus on facets (e.g., linguistic usage of a particular nation in a specific period). And, like archives, they do mix materials, albeit for the nonarchival purpose of representing broad linguistic usage. The triangulation between the digital humanities, corpus linguistics, and archival notions of corpus seems not well understood at present and may be a fertile zone for future theoretical and practical research.
7. See Blake, http://www.blakearchive.org/exist/blake/archive/object.xq?objectid=milton.a.illbk.43&java=no.
8. My mention of the New Historicism’s method of “anecdotes” is not pejorative. See my Local Transcendence for fuller discussion, and appreciation, of anecdotes and the New Historicism. Having migrated from the New Historicism in my early career to the digital humanities now, I am highly aware of convergences and divergences between the approaches. The New Historicist anecdote is in some ways the missing link between the New Critical verbal icon and the digital humanist model (e.g., a diagram or graph). It was close reading’s attempt to do distant reading.
9. See Reece Samuels for a cautionary critique of using Google Ngram Viewer to study “culturomics.” Samuels’s post includes summaries of earlier critiques of the metadata problems and other issues afflicting the Google Books corpora underlying the Ngram Viewer. For the concept of culturomics as it was introduced by the researchers behind the Google Ngram Viewer, see Michel et al.
10. While the usual perception of MRI (magnetic resonance imaging) is that it requires human bodies to be constrained for scanning in a single-axis tube and creates image slices on one plane at a time, in fact MRI technology is capable of simultaneous multiplane (“multi-angle oblique”) imaging and also Position Imaging™ (scanning in several bodily positions and configurations, including standing). The Fonar company, which originated commercial MRI, holds the patents for these techniques. See Fonar, “Our History.”
Bibliography
Blake, William. Milton a Poem. Copy A (c. 1811). The William Blake Archive, ed. Morris Eaves, Robert Essick, and Joseph Viscomi. http://www.blakearchive.org/.
Davies, Mark. The Corpus of Historical American English: 400 million words, 1810–2009. Brigham Young University, 2010. http://corpus.byu.edu/coha/.
Deleuze, Gilles, and Félix Guattari. A Thousand Plateaus: Capitalism and Schizophrenia. Translated by Brian Massumi. Minneapolis: University of Minnesota Press, 1987.
Duranti, Luciana. “Archives as a Place.” Archives and Manuscripts 24, no. 2 (1996): 242–55.
Fiormonte, Domenico. “Towards a Cultural Critique of the Digital Humanities.” Historical Social Research 37, no. 3 (2012): 59–76.
Fonar, Inc. “Our History.” n.d. http://fonar.com/history.htm.
Foucault, Michel. The History of Sexuality—Volume 1: An Introduction. Translated by Robert Hurley. New York: Vintage, 1980.
—. Madness and Civilization: A History of Insanity in the Age of Reason. Translated by Richard Howard. New York: Vintage, 1965.
Gitelman, Lisa, ed. “Raw Data” Is an Oxymoron. Cambridge, Mass.: MIT Press, 2013.
Hitchcock, Tim. “Voices of Authority: Towards a History from Below in Patchwork.” Historyonics (blog), April 27, 2015. http://historyonics.blogspot.com/2015/04/voices-of-authority-towards-history.html.
Hitchcock, Tim, Robert Shoemaker, Clive Emsley, Sharon Howard, and Jamie McLaughlin et al. The Old Bailey Proceedings Online, 1674–1913. Version 7.2, March 2015. http://www.oldbaileyonline.org/.
Liu, Alan. Local Transcendence: Essays on Postmodern Historicism and the Database. Chicago: University of Chicago Press, 2008.
Marres, Noortje, and Esther Weltevrede. “Scraping the Social? Issues in Live Social Research.” Journal of Cultural Economy 6, no. 3 (2013): 313–35. http://research.gold.ac.uk/6768/.
Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331, no. 6014 (2011): 176–82.
Moretti, Franco. “Conjectures on World Literature.” New Left Review 1 (2000): 54–68.
—. Graphs, Maps, Trees: Abstract Models for a Literary History. London; New York: Verso, 2005.
—. “‘Operationalizing’: or, the Function of Measurement in Modern Literary Theory.” Stanford Literary Lab, Pamphlet 6, December 2013. http://litlab.stanford.edu/LiteraryLabPamphlet6.pdf.
Morton. Timothy. Hyperobjects: Philosophy and Ecology after the End of the World. Minneapolis: University of Minnesota Press, 2013.
Samuels, Reece. “A Critique of Google NGram Viewer.” Reece_Digital_History (blog), March 4, 2014. https://reecesamuels7.wordpress.com/2014/03/04/a-critique-of-google-ngram-viewer/.
Schmidt, Benjamin M. “When You Have a MALLET, Everything Looks Like a Nail.” Sapping Attention (blog), November 2, 2012. http://sappingattention.blogspot.com/2012/11/when-you-have-mallet-everything-looks.html.
Social Networks and Archival Context Project (SNAC). Institute for Advanced Technology in the Humanities. University of Virginia, 2013. http://socialarchive.iath.virginia.edu/.
Srinivasan, Ramesh, Katherine M. Becvar, Robin Boast, and Jim Enote. “Diverse Knowledges and Contact Zones within the Digital Museum.” Science, Technology, and Human Values 35, no. 5 (2010): 735–68.
Srinivasan, Ramesh, and Jeffrey Huang. “Fluid Ontologies for Digital Museums.” International Journal on Digital Libraries 5, no. 3 (2005): 193–204.
Srinivasan, Ramesh, Alberto Pepe, and Marko A. Rodriguez. “A Clustering-based Semi-automated Technique to Build Cultural Ontologies.” Journal of the American Society of Information Science and Technology 60, no. 3 (2009): 608–20.
Stanford Literary Lab. Stanford University. http://litlab.stanford.edu/.
Theimer, Kate. “Archives in Context and as Context.” Journal of Digital Humanities 1, no. 2 (Spring 2012). http://journalofdigitalhumanities.org/1%E2%80%932/archives-in-context-and-as-context-by-kate-theimer/.
Trumpener, Katie. “Paratext and Genre System: A Response to Franco Moretti.” Critical Inquiry 36, no. 1 (2009): 159–71.
Tunkelang, Daniel. Faceted Search. San Rafael, Calif.: Morgan & Claypool, 2009.