45
The Ground Truth of DH Text Mining
Tanya E. Clement
In the digital humanities, text mining is a logocentric practice. That is, text mining in digital humanities usually begins with The Word. We extract The Word; we count The Word; we stem The Word to its root; we parse The Word; we name The Word; we disambiguate The Word; we collocate The Word; we count The Word again; we apply an algorithm that allows us to reconstruct the world of The Word as one we can visualize as a list, as a line graph, as a histogram in small multiples, or on big screens. We use the view this new world provides us to interpret The Word.
This practice of text mining presupposes a binary logic; there is meaning in the results or there is not. It begins with a “ground truth,” or labels that signify the presence of meaning. Sometimes we determine ground truth through annotations for machine classification: “Here, machine, are the love letters that Susan Dickinson wrote to Emily Dickinson. Please, find more like these.” Sometimes we determine ground truth after we receive clustering results: “Ah, machine, I see you have done your stemming and your parsing and your counting and you have given me a pile of words. I read them and will label them ‘whaling’” (though someone else might have said “indigenous economy”). “Ah, machine, I see you have clustered novels written by ‘women’ here and novels written by ‘men’ there. You are very clever. You must understand gender, just as I do.”
When engaged in this kind of text mining, we are reinscribing the simplest meaning of The Word. The authors of a text-mining textbook write that the results of text mining are easier to understand than numerical results because analysts “all have some expertise. The document is text.” (Weiss et al., 51–52). Likewise, even when we are humanists and feminists and should know better, we think we understand the machine’s results when they are words or when they cluster books according to an author of an “always already” gender. We see a pattern we think we can interpret, because we think we know what The Word means, and gender, which we have worked so hard to complicate, is suddenly reduced to “female author” or “male author.” The Word has been proved to serve as ground truth. The Word is apodictic.
Sound, by comparison, is aporetic. Mining audio spoken word collections means extracting acoustic features for classification, clustering, and visualization. Choosing features is complicated. The Word seems to be interpretable at a determined length. What length of a sound is meaningful? The Word seems to have typical patterns of characters, seems to perform regularly as a part of speech even in the context of complex sentences, seems to have a root that grows, more usually than not, in prescribed ways. Hearing sound as digital audio means listening through filter banks, sampling rates, and compression scenarios that are meant to mimic the human ear (Salthouse and Sarpeshkar). To mine these acoustic features is to understand that ground truth must always be indeterminate. Which features you choose and how you label that cluster of acoustic features that is sound will often be different from the features and labels I might choose. We must ask: Whose ear are we mimicking? What is audible, and to whom? Playback means choosing the damping ratios and frequency ranges that include overlapping and audible signals. We must ask: What signal is noise? What signal is meaningful, and to whom? Extracting meaningful features for mining sound always means interpreting not only what sound means, but how sound creates meaning. Mining sound reminds us that we have constructed an analysis according to our own experiences with how sound is meaningful.
As humanists, we seek questions, not solutions. Practicing sound mining alerts us to the fact that in text mining, The Word should also be aporetic. Instead of The Word, we are working with a word that is always indeterminate—meaning is both present and absent at once. We construct text-mining analyses according to our own experiences with how words make meaning. We must use text mining as a hermeneutic of The Word or as a hermeneutic of text mining or as a hermeneutic of hermeneutics. The Word does not provide evidence of meaning, of identity. A word in text mining is a foil to ground truth, not its proof.
Bibliography
Salthouse, Christopher D., and Rahul Sarpeshkar. “A Practical Micropower Programmable Bandpass Filter for Use in Bionic Ears.” IEEE Journal of Solid-State Circuits 38, no. 1 (January 2003): 63–70.
Weiss, Sholom M., Nitin Indurkhya, Tong Zhang, and Fred J. Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer, 2005.