Computing Criticism

Humanities Concepts and Digital Methods

Mark Algee-Hewitt

There is a tension present when we use computational methods to explore questions that originate in the disciplines of the humanities, one felt equally by practitioners of the field as well as those who are resistant to it. This is not simply a function of the distance between the methods involved, from computation to statistics to hermeneutics and interpretation, but something more fundamental to the different epistemes of the fields in which the methods originate. Too often, when we seek to bridge the difference between computation and interpretation, we privilege one side over the other. Either computation is presented as a method that can act as a corrective to humanities practices that are unable to take a larger context into consideration, or more often, it is suggested that hermeneutics and interpretation can help us understand where our computation goes wrong. At best, we can imagine the interplay between these poles as a set of separate but mutually informative models: we use phenomena familiar to, for example, literary theory (genre, character, theme, or setting) in order to train computational models that can recognize these phenomena, and then, in turn, we use these models to explore the phenomena within a context traditionally unavailable to humanities analysis in order to help us expand the theory that we used to train the model in the first place.1 In this feedback loop, the models (literary and computational) are mutually informative, but the distance between them is maintained. In this chapter, I argue that there are other, perhaps more productive ways of managing this tension. If we bring these fields into contact with each other in the space of our analysis, if we treat the transformations that the theoretical object must undergo to become computationally tractable not simply as a proxy for the phenomenon but instead allow it to persist as a radical reconceptualization of the object, then we open up a whole new set of possibilities for computational analysis.

Consider the following simplified schematic for how computational text analysis often takes place in humanities contexts. First, researchers assemble a digital corpus through which they will pose the questions of humanities interest that motivate their project. Next, the specifics of the question posed by the researchers are either operationalized through a computational model, which can be as simple as a word-frequency table and as complex as a deep learning encoder, or they are assessed through a supervised model (and the resulting patterns are post hoc associated with disciplinary concepts).2 Finally, the results are collected and interpreted in light of the original question and the specific features of the texts that are being modeled. Regardless of whether we understand the modeling process as a mechanism for producing empirical results in and of themselves (Underwood, 34–67; So, Long, and Zhu), or whether we intend to create a hermeneutics in which the results of a computational analysis serve to redirect our interpretive lens toward specific parts of a text (Piper, 66–93; Froehlich), these steps often serve as a three-part framework for the process of text analysis in the humanities. At the center of this schema, the interface between the critical and digital or between the hermeneutic and the computational maintains the theoretical distance between the model and the object of its operationalization.

The effects of this distance impose artificial limits on how we can combine computation and interpretation in our research. When the results of our analysis go awry, when they provide unexpected answers that contradict aspects of the corpus that we know to be true, we seek fault at exactly this junction. We assume that either the corpus that we have created to investigate our question is insufficient (in size or in focus) to represent the textual field that we seek to investigate or, more commonly, that the methods that we have selected through which to operationalize our question are not capturing the specific phenomena that we want to explore. The iterative work of text analysis (returning to the model to adjust our parameters, reconfiguring our corpus, changing the underlying methodology of our analysis, and then running the experiment again) depends on our ability to trace the faulty model to its root in order to create a better fit between what we want to investigate and the computational methods through which we seek to represent it. If the results of an authorship attribution assign differently authored books to a single author (or if the corpus of a single author is unpredictably bifurcated into multiple authors), if a topic model is full of “junk” topics, or if a named entity recognition (NER) parse collapses or divides character mentions into unexpected configurations, then we return to the model to correct the fault and run the analysis again.3

Such a unidirectional approach toward operationalizing textual phenomena presumes that we know what we are looking for. In other words, it assumes that when we model author or theme or character, we understand what those objects are, such that any discrepancy between our understanding of these phenomena and our computational model lies in an error of how the model represents the real-world object. But this approach misses the fact that none of these phenomena are real-world objects: rather, they are the apparatus of a textual theory whose parameters are equally as approximate as the computational model to the complexities of the text itself. By assuming that our interpretive or theoretical models are better representations of the text than our computational ones, we overlook one of the most potentially productive uses of quantitative textual analysis: that by modeling one of these theoretical phenomena through computation, we can actually gain a new understanding of the textual phenomenon that we model as an author, a theme, or character. That is, rather than interpret flaws in the analysis as an underlying fault in our ability to represent real-world objects digitally, it is entirely possible to use these discrepancies to reevaluate what we think we know about the object in question.

A new strategy of this kind requires a turn in our thinking. Instead of assuming that the quantitative model represents an imperfect proxy for a real-world object (i.e., the text), it encourages us to take the quantitative results at face value and ask what they can tell us about our theories of text. This is a process that would be familiar to much early work in the digital humanities: it shares, for example, the defamiliarizing process that different textual transformations (through quantification or even just through digitization) make possible.4 Through this chapter, I want to recover this process in order to understand what changes, both in our analyses and in our theories, when we elect to take the quantitative results that produce an authorship analysis, a named entity recognition parse, or a topic model at face value rather than as imperfect measurements. By putting quantitative textual analysis back in contact with the textual, often literary theory that destabilized and called into question these supposedly stable categories, we can return some of the revolutionary potential of early digital humanities work to our current practice.

Distributions of Authorship

The goal of stylometry is often understood to be particularly straightforward, at least when it is applied to questions of author attribution. Given a set of texts with known authors can we (a) derive a quantitative approximation of an author’s style, (b) group known authors together so that individual words cluster (or classify) with their respective author, and (c) create a model robust enough that an unknown work can be correctly associated with a known author? Due to the strength of the so-called author signal within a collection of texts, stylometry operates most effectively on the most frequent words in a corpus, whose minute variation at large scales creates the distributions that can most easily differentiate between authors. Although the exact number of features is a subject of some debate in the literature, it remains an astonishing effect that relatively few function words, often no more than the 300 most frequent words (MFWs), can so effectively sort a corpus into authorial groups.5 Take, for example, a plot of 518 works of detective fiction (written between 1881 and 2000) by twenty-five authors (Figure 2.1).6 Here, the icons on the graph represent individual novels; their shapes, which generally group together neatly, represent the known authors of each text. The effect of this analysis is startling: across over a century of difference, authors’ corpora still remained tightly bound in distinct, recognizable clusters (for the most part), whether those groups are found at the center of the diagram (like the works of Ellery Queen) or in a cluster on the periphery (like the corpus of Erle Stanley Gardner).

A t-sne scatterplot showing detective novels as symbols representing authors. Five groups of symbols are circled and identified as groups A through E. — Figure 2.1. Authorship similarity between twenty-five twentieth-century authors of detective fiction based on 200 most frequent words, visualized through a t-distributed stochastic neighbor embedding (t-sne) plot, which uses a modeling algorithm to best position each text to represent its similarity to other texts in the corpus (texts closer to each other in the plot have a greater similarity).

Such an analysis might be disorienting to students or scholars without experience in the digital humanities. The notion that quantifying such a small and nominally content-free word set can capture something as meaningful, as deeply human, as style or authorship seems at once miraculous and terrifying. But it is worth being specific about what a plot such as this actually captures. Authorship, unlike many other literary or textual phenomena, has a real-world extension in the persons of the authors that we credit with writing the texts. When we say that an analysis such as this captures the style of Ed McBain, we are implicitly making an argument that connects the words (the high-frequency MFWs) to the embodied personhood of the author such that there seems to be something connecting the physical author to their style. Practitioners of authorship attribution have devoted a significant amount of text to explaining this phenomenon in terms of an often rather naive cognitive psychology. The variations between authors, we are told, can be resolved into differences in “background and training,” where “function words are used in a largely unconscious manner” and can therefore capture “pure stylistic choices” that are “somehow affected by an unconscious personal stamp” (Baayen et al., 1; Stamatatos, 540; Eder, 103). Some even go farther and use the same statistical processes to ascertain “basic psychological processes” or even infer personality and mental health, including trauma and depression (Argamon et al.; Juola, 74). Confounding factors in the analysis are ascribed to the chosen model features or are perhaps “due to personal development of the author over time” (Diederich et al., 110). These claims have two key aspects in common: first, the authors of these studies supply little evidence linking the cognitive state of a real-world human mind to variations in the statistical patterns that their attributional findings use to associate groups of texts; second, aside from the studies that specifically try to link psychological processes to textual patterns, there is no reason to even engage cognitive or psychological theory at all.7

After all, when treating authorship attribution as a purely text classification problem, the actual persona or even embodiment of the author is often immaterial to the analysis. If the texts cluster together such that texts with the same author’s name are associated, the link between that name and the physical human, or their mental state, is arbitrary at best.8 In fact, relating commonalities in word frequency distributions back to a unique unconscious is frequently complicated by the slippery nature of authorship itself. Ed McBain, clustered somewhat reliably by the graph in Figure 2.1, is the name under which Evan Hunter wrote his best-selling genre fiction, and Evan Hunter is the adopted pen name of Salvatore Lombino, who also wrote under the pseudonym Hunt Collins and Richard Marsten (MacDonald). While on the surface, this seems like the ideal case for an authorial attribution study (whose goal would be to reassociate the various pen names used by Lombino under a single category), such a study would heavily weight the biographical theory of authorship while, at the same time, undercutting the ways in which Lombino’s books were marketed and sold to the public. In fact, in Figure 2.1, McBain’s novels are actually located in three distinctive clusters: (A) toward the left-hand side of the graph, (B) to the right, separated by the clusters of Elizabeth Daly and Frederick Schiller Faust, and (C) on the right periphery of the central cluster. In classic authorship attribution, this would be read as an error: one unique individual whose works are wrongly differentiated into three distinct authors. But given the complexities of his own authorial identity, the fractured nature of McBain/Lombino/Marsten/Collins is partially captured by the apparent “flaws” in the analysis.9

This complexity of authorial layering—the relationship between the author as human and the author as signatory of a body of work—is deeply familiar in a literary studies context. It is precisely the nexus of bibliographic, performative, and fictional identities that drives Michel Foucault’s “What Is an Author?” (Foucault).10 Foucault’s essay remains valuable in its reconfiguration of the author into the author-function: a recognition of the purely classificatory work of the authorial signature. “A name can group together a number of texts and thus differentiate them from others. A name also establishes different forms of relationships among texts” (123). In this grouping function, the author becomes a convenient shorthand to describe similarities among a body of work. In discussing the Nancy Drew series as the work of Carolyn Keene or the Baby-Sitters Club as the work of Ann M. Martin, we are typically less interested in the presence of ghostwriters or an editorial house style than we are in understanding each collection of texts as belonging to a meaningful classificatory group. This is not to say that differentiating Ann M. Martin from her ghostwriters might not have value within the context of a particular authorship study; rather, understanding the group as “by” a single or multiple author-function(s) is equally valid, depending on the perspective of the analysis.

It is in Foucault’s theory that we can find an unexpected power in stylometric authorship attribution if we let go of the implicit connection between author and author-function (that is, between the lived experience of an individual and the signatory on a collection of texts) that lives in the pseudo-psychological explanations for the connection between style and (un)consciousness. In its statistical model, stylometry provides an unexpectedly simple yet powerful answer to the question animating (and titling) Foucault’s work: What is an author? The answer, according to the computational model, is that it is a set of word frequencies that cluster texts into recognizable groups with high probability. Such an explanation of authorship does not need the apparatus of cognitive psychology, not merely because, in the contexts of the articles that I discuss above, such connections are worse than arbitrary, but more importantly because defining authorship in this way offers a greater explanatory power for associations between texts that do not rest on the mere accident of a single individual authoring a cluster of books. The other kinds of work that the author-function performs—conceptual, theoretical, social, and classificatory—can actually be better observed if we allow authors to recede from stylometry and rest our theories specifically on the new ways in which quantitative analysis allows us to reconfigure the classification of texts. What I am suggesting here is a radicalization of what is already central to stylometry: the substitution of word patterns for the figure of the author as a unique consciousness. Our theories, unprepared to grapple with this reconfiguration of authorship, seek the comfort of offering a cognitive explanation for a statistical phenomenon: by preventing or resisting this, we are able to leverage stylometry to understand different kinds of author-functions.

Such an approach is particularly relevant in cases wherein a stylometric analysis either groups together authors known to be different (for example, in the presence of a strong editorial or house style) or breaks apart clusters known to be by a single signatory author. In Figure 2.1, for example, Sue Grafton, author of the “alphabet series” of detective novels, is neatly divided into two distinct clusters (D and E). With some refinement to the graph (changing the feature set, adjusting the number of authors, or moving to a different clustering method), it is difficult but possible to resolve this split and cluster Grafton’s works together; however, doing so again reifies the a priori author over the potential information that the quantitative methods reveal.11 The split in Grafton’s alphabet series, as it turns out, is not random: books A Is for Alibi through F Is for Fugitive lie on the left side of the Mary Roberts Rinehart cluster (D), while books G Is for Gumshoe through N Is for Noose are on the right side (E). Such a split marks out a clear change in Grafton’s writing over time: the fact that the split persists over different feature sets and models as two distinct clusters (as opposed to a drawn-out gradient) suggests that this division represents something fundamental about Grafton’s writing. While it is tempting to use this division to explore the possibilities of a so-called late style, there is another, equally important historical dimension to this change.12 In Grafton’s case, F Is for Fugitive marked the first entry of one of her novels into the New York Times Best Seller list (Cowles). What the analysis naively divides, then, is the group of novels written before the prestige of a bestseller with an eye toward creating a popular readership and the group of novels belonging to a series that has already achieved bestseller status. The changes in publishing, marketing, editorial practices, and Grafton’s own attention to the series that occur alongside this shift in popularity create a gulf in the novels that, although imperceptible to readerly attention, is still captured by the minute changes registered by stylometry. What this suggests is that we are right to understand the identity of “Sue Grafton,” author, as bifurcated: there are two Sue Graftons that are fundamentally irreducible within a social understanding of the publishing and bookselling industry.

This is not to say that the data indicates the presence of a ghostwriter or even necessarily a heavy editorial hand: the data, I want to argue, does not indicate anything about the embodied author at all. Rather, it suggests that among the different ways that the author function can be understood, there is a detectable difference between texts written before and after they belong to a set that has come to prominence through such a mechanism of popularity as the New York Times Best Seller list. We can call this set either the “alphabet mystery” novels, or “the novels of Sue Grafton”: the label is largely immaterial. What is material is the ways that the set can be reliably differentiated based on a very specific historical circumstance. Of course, the interpretive steps that follow rest on how we want to understand the division. If we do want to speak to the figure of the author as traditionally understood, we might seek to interpret the difference as that between an author before and after they are aware of their success. Or, if we want to analyze the series, we could examine the difference in editorial practices in a publishing house before and after they know that they have a hit. What the quantitative analysis gives us (along with the data that reveals this bifurcation) is the choice between these interpretive avenues. Neither of these interpretations were hypotheses driving the analysis at the outset or fundamental to the model: the choice between them is driven by scholarly interest rather than the parameters of the data.

This, I want to suggest, is the power of quantitative analysis: despite our desire to describe it with reference to our intuitive understanding of authors as the origin points of texts, the data itself describes similarity through the semantics alone. Splitting traditionally one-author corpora or grouping multiple authors together within a single unitary cluster adds dimensions to our understanding of similarity and difference within a corpus. By letting go of the connection between pattern similarity and authorial embodiment, we can gain a fuller understanding of the different ways that texts might be understood as similar or different based on social, political, conceptual, or even authorial pressures. If we allow a computational model of authorship to guide our exploration of the corpus, we can surface new patterns of relationship between texts that may not conform to our assumptions about “authors” but that can nevertheless powerfully alter our understanding of the texts themselves.

Character Patterns

The works of Jane Austen are central not only to understanding character but also to understanding the history of how character has been theorized and analyzed, both critically and computationally. Just as studies of character centralize Austen’s creation of the first modern character-system, so too do Austen’s works provide a key testing and pedagogical ground for natural language processing (NLP) methods.13 In particular, research into the application of named entity recognition (NER) as a potentially reliable way to extract (and resolve) character coreferences has depended on Austen’s corpus to a surprising extent.14 Characters in Austen occupy a unique inflection point. No longer the parodic or archetypal figures of eighteenth-century fiction, they have still not assumed the more experimental aspects that characterize the novels of the late nineteenth or twentieth centuries. Austen’s protagonists, in particular, appear to the reader as “rounded” or “deep” and well situated within a system of major and minor characters that offers a fully psychologized world familiar to her readers. It is for this reason that Austen’s texts, more than any other, serve as the primary example of NER. Her ability to give the impression of an inner psychology to Elizabeth Bennet or Mr. Darcy of Pride and Prejudice suggests the possibility of a psychological permanence within the text: an inner life of a character whose various names, actions, and pronouns can be resolved into a single individual being that takes on the attributes of personhood.15

The question of the inner life of literary characters resides at the roots of the debates that surround them. It is clear that characters are one of the most recognizable and transportable phenomena of literature: when describing books, students leap to characters much more quickly than they do plot, theme, or even setting (Vermeule, 49). For all of their recognizability, however, they are notoriously hard to define. Even when they form the basis of a formalist theory of text, they remain slippery entities to pin down. To Vladimir Propp, the formalist critic, for example, characters are the foundational unit of the story, the medium through which action propagates. Propp is a crucial figure in this history, as his reading of character can be traced through the work of poststructuralist theory in which (much like the death of the author) characters become either psychoanalytic archetypes or reduced to mere rhetorical effects—the words on the page (Lacan, 6–50; Barthes). Recent work has refashioned the character into a nexus of the interactions between the text and the expectations, ideas, and desires that readers bring to it (Frow; Vermeule; Lynch). Yet once again, a hard look at the specifics of the closest analogue to character that computation has to offer reveals not only a key strategy for extracting and tracing the appearance of characters across a corpus of texts but also a new understanding of the material of character itself.

A computational approach to NER and, equally as important, coreference resolution is, by definition, pattern-based.16 From early coreference resolution systems that used decision trees and complex syntactical rules, to contemporary models that use neural solutions, coreference resolution—the operation of deciding object permanence in characters—has always relied, like readers, on patterns of reference (Elango; J. Wu and W. Ma). This method, however, converges on contemporary literary theories of characters within texts. The complex assemblage of rhetorical effects (Barthes), narrative necessities (Propp), and readerly expectations (Lynch) creates a system of reference through which readers can come to understand the individual identities and roles of characters within a particular novelistic world. Austen’s work serves as a particularly cogent example of this process at work: Alex Woloch’s theory of major and minor characters rests of what he calls the “character-system” of Pride and Prejudice that weaves an inter-referentiality among the Bennet sisters according to their narrative (and economic) movements throughout the text (Woloch, 61). Where Woloch’s theory and contemporary NER align is in their use of patterns of character relationships to establish the rules of a system through which each character can be understood as part of the whole. NER, however, takes rules of the system seriously and, in doing so, once again suggests a more radical reading of the character-system than is offered in contemporary literary theories. As with the author, the seemingly straightforward answer to what a character is that is supplied by NER—a referential set of terms that exists in a structured relationship to the syntactical universe of the text—opens up a new set of possibilities for understanding exactly how characterization happens within a novel. The naive reading of the NLP model, its successes, and more importantly, its failures, in tandem offer us a way to understand the actual operation of character within a text without the baggage of our own assumptions about the permanence of identity and the identifiably of individual characters.17

Table 2.1. Counts of Tokens Resolved as Coreferences to Mr. Darcy (Character 2) in Pride and Prejudice Using BookNLP
Character	Token	Count
2	darcy	374
2	he	318
2	her	10
2	him	189
2	his	299
2	mr.	273
2	she	9

A key breakdown of NLP that gives us insight into complex modes of characterization in Pride and Prejudice itself occurs midway through the novel.18 Woloch places the binary of Elizabeth Bennet and Mr. Darcy at the center of his theory, arguing that the sharpness with which this pair is defined and juxtaposed creates a dialectic through which the minor characters of the novel are resolved into the same world (Woloch, 50). The differentiation between Elizabeth and Darcy seems key to the character system of the text: the gendered nature of their relationship situates the individual identities of the characters within both an intersubjective and social system that rests on the separation of the two as individuals fulfilling the necessary roles in the romance plot. From an NLP perspective, the “he” and “she” of the romance narrative should be among the most concretely identifiable based on the patterns of the novel. A BookNLP parse of the novel, however, reveals critical moments where this binary breaks down. As Table 2.1 shows, Mr. Darcy, identified by the model as character 2, is resolved into 1,472 coreferential tokens, where 647 are a variation of “Mr. Darcy” and 806 are some variation of a male pronoun. That leaves 19 instances when Mr. Darcy is referred to as “she” or “her.” If we assume that each of these instances is an error on the part of the parse, then the overall correctness is 98.7 percent, which is an acceptably high rate of precision. When we examine the individual points at which Darcy is misgendered with female pronouns, however, many of these so-called errors reveal something much more complex in the text.

At the beginning of chapter 10 of the second volume of the novel, Elizabeth, residing in Kent, has begun to encounter Mr. Darcy during her walks in a seeming coincidence that indicates, to the reader, Darcy’s interest in a romantic attachment:

He never said a great deal, nor did she give herself the trouble of talking or of listening much; but it struck her in the course of their third recontre that he was asking some odd unconnected questions—about her pleasure in being at Hunsford, her love of solitary walks, and her opinion of Mr. and Mrs. Collins’s happiness; and that in speaking of Rosings and her not perfectly understanding the house, he seemed to expect that whenever she came into Kent again she would be staying there too. His words seemed to imply it. Could he have Colonel Fitzwilliam in his thoughts? She supposed, if he meant any thing, he must mean an allusion to what might arise in that quarter. It distressed her a little, and she was quite glad to find herself at the gate in the pales opposite the Parsonage. (Austen, emphasis mine)

Here, the highlighted pronouns indicate all of the entities resolved into character 2—everything that the model guessed was Mr. Darcy. Far from being separate, Darcy and Elizabeth are collapsed into the same entity within the first sentence (“He never said a great deal, nor did she give herself the trouble”). The reference passes back and forth between them until it is entirely resolved into Elizabeth by the end of the passage. The syntax here complicates the ability of the model to easily resolve the pronouns into a character, despite their gendered nature. The movement of the first sentence as the clauses trade between the two halves of the dialogic exchange (or lack thereof) run the two characters together as, narratively, both converge on the same interpersonal tactic: as co-participants in the social act of walking and not talking, they act effectively as one. Moreover, as the passage continues, it becomes clear that this is Austen using a form of free indirect discourse to probe Elizabeth’s thoughts and suppositions about Darcy. Her lack of certainty over his goals in their interactions is relayed in a passage focalized through Elizabeth as she seeks to understand Darcy’s motivations, effectively creating her own version of his character to herself: “She supposed, if he meant any thing, he must mean an allusion to what might arise in that quarter.” By this point, all references to character 2 are inhabited exclusively by Elizabeth. Or, more accurately, they are inhabited by her suppositions about his character. As Elizabeth begins to cognitively perform Darcy, she takes over his referentiality: she effectively becomes Mr. Darcy from the point of view of the narrative.

What the coreference captures as it “mistakes” one character for the other is the convergence of the two in Elizabeth’s mind, which becomes an act of virtual puppetry as her mental model of Darcy escapes the boundaries of the “real” Darcy and becomes a version of the character created for and by her, fully inhabited by her subjectivity. In its naivete regarding the character’s gendered roles, the coreference resolution model has uncovered something profound about Austen’s characterization. The interpenetration of the two characters at this point disturbs the boundary that separates them. Temporarily, their back-and-forth intersubjectivity followed by the performance of Darcy within Elizabeth’s focalization blends the two characters together, dissolving the space between them and creating an amalgam of Darcy-as-Elizabeth-as-Darcy that is crucial to a pivotal moment in the romance plot (immediately following this chapter, Elizabeth discovers Darcy’s agency in separating her sister from her fiancé, which again drives a wedge between the two characters). The indistinguishability of the two central characters is not a flaw in the NLP model, nor an effect of the general similarity between all characters, but rather a crucial, carefully plotted set of effects that demonstrates the interplay between the various psychological states that Austen has established for her characters. Once again, this insight is made available not through a careful close reading but rather through the transformation of character into a syntactic and semantic pattern through the NLP parse. Like an authorship attribution, reducing the characters to the language patterns that animate them on the page reveals the extent to which they occupy the same set of spaces and identities. When we, as readers, bring the assumptions of psychological interiority and individuality to our encounter with the characters, we miss the extent to which the boundaries between them are permeable in ways that are crucial to both the plot and the effect of characterization itself.

What would it mean, then, to improve a natural language parse to the point where this slippage disappears? An updated version of BookNLP built on the LitBank work that Bamman describes in chapter 3 of this volume requires, in Bamman’s words, “a stable entity for each character [given through coreference] that such counts can apply to.” Such stability, however, imposes a specific reading on the text—namely, one that requires characters to remain stable entities rather than shifting, interpenetrated, and contingent figures that always exist in a complex negotiation between writer and reader. If we do understand the misgendered Darcy as a function of Elizabeth’s performance of the character in her own act of reading, then the model that Bamman proposes privileges the reader’s reading performance over Elizabeth’s, just as the current BookNLP model makes Elizabeth visible as a reader of Darcy. In arguing for a born-literary NLP—that is, an NLP model that treats literary phenomena, such as character, as a referent to be modeled rather than as an alternate model of the text—Bamman privileges an understanding of character as a persistent, stable entity over the textual complexities of Austen’s diegesis. To be clear, such a model would not be wrong: my point is that it is not objectively right or that we should understand the model as improved by the loss of this complex nuance in the text.

Just as in the previous section, my intent is not to remove the author from consideration but rather to open up a manifold space of possibility for what an author could be (not Barthes’s death but Foucalt’s function). My goal here is not to return to the flatness of the poststructural character-as-rhetorical effect but rather to understand the kinds of malleability afforded to the assemblage of words, syntax, and impressions (both external and internal) that we call character. Literary characters are not persons: we as readers understand this. Yet on the page and outside of it, we imbue them with a spectrum of possibilities that exceeds mere personhood and rests on our dual understanding that these are rhetorical effects that we can treat as objects, archetypes, or categories of being. Darcy’s ability to transcend the page for the reader begins with his ability to transcend the boundaries of his own characterization: in a very real sense, Elizabeth is the first reader of Darcy and the first to imbue him with a psychological interiority, setting the pattern for generations of readers to come (Vermeule, 51). Although this model of nested character interaction, in which characterization becomes a performance of a psychologized interiority, has precedence in literary scholarship, particularly that on Austen, a computational model can aid us in radicalizing our understanding of what character is, thereby deepening our understanding on how it works in the text itself.19 Characters modeled “incorrectly” by NLP software therefore do not necessarily result from errors in the fitting of object to model: rather, they can direct our attention toward the ways in which characters themselves are hybrid and contingent proxies, less real-world objects than literary critical models of being.

Topic and Theme

In the case of authors and characters, there is a clear line between the statistical approximation and the object it seeks to represent. After all, the challenge in thinking through the relationships revealed by authorship analysis is in moving beyond the embodied author; even if characters are slightly more ephemeral, few readers would dispute that Elizabeth or Mr. Darcy are characters. When it comes to topic models, however, the object of analysis is much more opaque. Simply by calling the algorithm a “topic” model, we are already making an implicit case that what we are modeling are topics; however, this just displaces the question from the computational to the textual node. In other words, we are still left with the question of what a “topic” is in the context of a text.

Much of the critique of the use of topic modeling in humanities contexts lies in the tendency of readers to overinterpret the individual topics themselves, investing the wordlists that stand in for topics with thematic, discursive, or relational meanings that they may not carry (this is particularly the case for low-coherence topics).20 In fact, it is still unclear as to the best way to represent a topic (which, in itself, is a distribution of posterior probabilities over a vocabulary). While scholars most often relay topics as a wordlist of top terms per topic (replicating the output of the most popular topic modeling software), it is also possible to describe topics with a series of meaningful names assigned by the analyst or as a network of semantic connections (McCallum; Lau et al.; Goldstone and Underwood). All suffer from particular weaknesses: lists of top terms may overrepresent extremely high-frequency or low-frequency terms at the expense of more meaningful words; labeling topics often overgeneralizes their specificity; and networks of topics are still based on a limited list of top terms. As such, the status of topic models in the humanities remains in flux: it is a popular method, but one with significant drawbacks in interpretability or meaningfulness.

Many early critiques of topic modeling stemmed from the disjunction between the design of the algorithm and its use by humanists: as an information-retrieval tool, the models would generate lists of co-occurring terms as a mechanism for labeling clusters of documents, not for the kind of close reading that humanities scholars try to bring to bear on them (Schmidt; Mimno et al.).21 A key sign of the success of these critiques lies in the caution with which humanists now deploy them: scholars are encouraged to think critically about the kind of information that a topic model produces and how that might be read as meaningful (or not) within humanities contexts.22 Even as we think through the best ways to represent topics within the work of the computational humanities, however, the problem of their referentiality remains a significant obstacle to understanding the results that they produce.

Much as in authorship attribution, where the assumed connection between the distribution of frequencies and the embodied author leads scholars to gesture toward causal cognitive explanations for the clusters, scholars working with topic models will similarly gesture toward “discourses,” “aboutness,” or above all, “theme.”23 The problem here is that even if topics could be resolved into themes with some rigor, the use of “theme” in literary studies is even less rigorous than that of “topic.” Propp’s turn to character in his classification of folktales is predicated on the impossibility of classifying texts according to a higher-order taxonomy, particularly that of theme: “If a division into categories is unsuccessful, a division into theme leads to total chaos” (Propp, 7). In Erich Auerbach’s Mimesis, theme appears to be an important consideration, and yet theme and motif slide into each other with such regularity that both lose a degree of rigor (Auerbach, 470–71).24 In other words, theme is no more helpful than topic for understanding what is being revealed by the unsupervised probabilistic model. Without this conceptual apparatus, relating each topic to its textual equivalent becomes doubly fraught. If we understood topics as representing themes and we had a rigorous definition of “theme,” then we would be able to hermeneutically read topics in the ways that so many scholars of the digital humanities seek to do. When Andrew Piper describes a topic as “a heterogeneity of statements under a semantic field” he is, in part, speaking to the multiplicity of things that a topic can carry within its distribution (Piper, 74, emphasis in original). But it might be equally as meaningful to apply this heterogeneity to what we can mean by both topic and theme.

The problem is one of scale: a theme or motif, in Auerbach’s sense, may be as sprawling and complex as romance or as limited and singular as the garden in which the lovers meet. This is why Propp found it impossible to create a taxonomy based on such differences. Likewise, a computational topic might easily be as expansive or specific, depending on the frequency of the words it contains and its probability distribution across the texts of a corpus. For a taxonomy, such heterogeneity makes comparison (and thus differentiation) impossible. In even the best topic models, we still remove a significant percentage of topics, not because they fall into one of the categories of low-quality topics described by David Mimno and colleagues, but because the phenomenon they capture is incommensurate with those captured by the rest of the topics within that model.25 Reading two topics with this degree of difference, not just of subject, but of scale, is as much an exercise in futility as the Borgesian categorization that Propp describes in which tales about unjust persecution are differentiated from tales with three brothers (Propp, 8).26

Once again, the solution that I want to suggest lies in taking the methodology literally. As an information-retrieval tool, topic models excel at organizing documents into clusters of texts based on shared clusters of terms that themselves occur together with high probability. In the instructional literature on topic models, the list of terms associated with each topic (as distinct from the posterior probabilities of terms in topics) is generated as a heuristic to help analysts give labels that assist in indexing the clusters generated by the posterior probabilities in the model, not as an end of the analysis in and of itself. If a topic model is applied to a large, undifferentiated corpus of texts and generates recognizable clusters of documents in ways that answer questions about difference and similarity, then the actual topical composition that creates the clusters may be, in many cases, immaterial, given that we have much better information with which to diagnose cluster membership, once those clusters are identified. In fact, I want to put this in even stronger terms: the desire to “read” the topic can be, in many cases, detrimental to understanding the principle behind the distribution of texts generated by the model because we, as human readers, are trained to treat any set of word clusters as if they all belonged to the same hermeneutic space.

Take, for example, a topic plot of the Gale American Fiction corpus, which contains 18,040 novels.27 To generate this plot, I applied a topic model with 150 topics in Mallet and reduced the posterior probabilities of the topics in texts to a two-dimensional plot using T-sne.28 Each dot represents a novel whose relative position is a function of the topical similarities and differences from the texts that surround it: the posterior probabilities of topics in documents index the similarities and differences between texts. The individual points are colored according to the most distinctive topic in each text, allowing us to explore the graph both for the clusters that reveal the underlying structure of the data and the topics whose probabilities determine these patterns. What is immediately apparent is that a clear set of clusters does emerge from the analysis. Texts are predominately grouped by shared topics (particularly in the well-defined clusters at the periphery of the graph), and adjacent clusters typically share topics among their top three most distinctive topics.

But while the data reveals that there is an underlying structure to this corpus of novels, the nature of that structure is surprisingly difficult to decipher from the topics alone. The cluster of novels at the very bottom of the graph (A) predominately shares a topic (topic number 6) whose ten top terms are captain, ship, deck, men, board, vessel, crew, sea, cabin, mate. Conversely, the top of the graph features a small but significant cluster (B) whose top topic (topic number 42) contains professor, college, class, one, said, university, room, year, student, study. Together these two topics point to two key genres of nineteenth- and early-twentieth-century American literature: sea tales and college novels. Genre, therefore, seems to be a key element of the structure of the graph; and genre, from these two topics, seems largely concerned with setting: at opposite poles of the graph, we find novels set at sea and novels set at college campuses. But what, then, are we to make of the sizable cluster (C) on the top left of the graph? It appears to be as equally distinct as either the sea or college novels, but its top topic (topic number 137) contains a much different set of words: man, door, roof, house, one, found, two, night, woman, find. While topics 6 and 42 can explain the settings/genres of their respective clusters (ships and cabins versus professors and colleges), topic 137 offers very few clues as to the relationality of the texts that it groups. Most novels, after all, contain men and women, or doors and houses.

If, however, we move beyond the topic to the fact that these texts form a coherent cluster and actually look at its members, the logic behind the cluster immediately becomes clear. Among others, the cluster contains A Tragic Mystery, The Trevor Case, and The Silver Blade: A True Chronicle of a Double Mystery. What the model has revealed, then, is the mystery/detective genre. Working backward from this knowledge to the topic, we can certainly imagine the ways in which the terms night, woman, house, and above all, found and find, work within this genre; but, crucially, this reconstruction can only happen iteratively, in retrospect. The process of finding the cluster and identifying its commonality can only be done without any reference to the topics themselves. The same can be said for the other two clusters. Our sea tales cluster contains such novels as Thirty Years at Sea, The Adventures of a Naval Officer, and of course, Moby Dick. Likewise, with titles such as A Princetonian, College Girls, and When Patty Went to College, the connecting theme of the college cluster is also apparent from the metadata. Again, the topic model reveals crucial information about how texts cluster without reference to the composition of the topics at all; what it reveals is a set of clusters that seem to follow the same logic as Vólkov’s catalog of themes that Propp believes approaches chaos. We understand sea tales and college tales primarily in terms of their setting, but mysteries are predominately plot driven.29 Exploring the clusters further, we find religious novels (both realist novels of faith and historical novels set in the biblical past), fiction about women in the workforce, and a cluster of temperance novels. To catalog these as “themes” is to make the same mistake as trying to decipher the clusters from the lists of words that are the topics themselves.30

By resisting the urge to begin by reading the topics and (over)interpreting the wordlists that they contain, we can explore the large- (and small-) scale clusters through the metadata that we already have. The incommensurability of the topics (much like that of themes) resolves into a set of distinctive groupings whose identity offers a complex but interpretable set of phenomena or discourses around which this set of American novels of the nineteenth and early twentieth centuries are structured. The topics, of course, remain a crucial part of the downstream analysis, but rather than being led by the naive interpretation of the method itself as a way of revealing structure within a collection of documents, we can gain a new understanding of the variety of things that “theme” or “motif” (or even “plot” or “setting”) might represent within the context of a corpus. Much in the same way that a literal approach to stylometry allows us to rethink what we mean when we talk about an author, or how coreference resolution helps us to understand where character lives in the space between the text and the reader, topic models can help us understand the heterogeneous ways in which texts can group together outside of, or beyond, a comprehensive definition of theme, field, or discourse. Topic models call our attention to the ways in which these categories themselves are technologies of information retrieval: all represent higher order ways of organizing, relating, and categorizing texts. When a topic model sorts a corpus into an unexpected arrangement, then, it can defamiliarize our understanding of its organization and surface new connections that are equally as meaningful for that corpus as any of the critical schemas that we may seek to impose on it.

Reading the Model

What unites all of the brief experiments that I have offered here is an argument in favor of complicating the standard practices of operationalization and modeling in the computational humanities by taking the results of these practices at face value. Our traditional ways of understanding both operationalizing and modeling begin with the presumption that we are using computational methods to approximate something “out there”: a phenomenon that exists independently of our analysis.31 Taking this approach, however, mischaracterizes the relationship between literary theory and the computational model that seeks to approximate it: both are, in fact, models of the text, and both approximate the complexity and multivalenced nature of the object itself. It is only because we have become so long habituated to the ways that we read and talk about text that our critical models of literature (author, character, theme) appear to be the “real” object that our computational methods simulate. We can therefore understand the analytic distance between the two, the literary and the computational (or the humanities and the computational), as an effect of this habituation. By recognizing both approximations as models (the author as a unique, singular genius solely responsible for the text and the author as a collection of word probabilities), we can finally bring both halves into contact with each other and create a mutually informative account of the object that can reveal surprising new aspects. Doing this within the space of humanities research alone involves pushing against a surprising amount of inertia: for example, our understanding of an author is freighted with a set of social, cultural, psychological, and economic concerns whose viability requires that there be a single authorial source for a text. By complicating our understanding of a phenomenon like authorship through a computational model, we give ourselves the opportunity to revise our understanding of the underlying phenomena outside the constraints of habituation and practice such that we can radically alter our understanding of the object itself.32

A turn such as this, however, is not without its risks. For example, Hannah Ringler argues for a strong hermeneutic approach to computational analysis, arguing that methods alone are unable to interpret the data that they produce. In the new framework that she proposes, however, the process of “asking with” returns us again to our habituated forms of humanities knowledge: “in this process, an analyst starts with artifacts that they know well, asks humanistic questions of them, and uses tools to help them answer those questions.”33 The methodological intervention that I propose in this chapter asks us to let go of the idea that we know our artifacts well. Interpretation is still a crucial part of the research process, but rather than simply bringing humanities interpretive methods to the results of a digital analysis, we allow the results to guide our analysis, taking seriously and literally the sometimes radical transformations that they enact on the artifacts that we only thought we knew. Interpretation, then, ceases to be a second-order operation on the results of the analysis and becomes a negotiation between computational and humanities models of the phenomenon that we seek to study. The second pillar of Ringler’s framework, “asking about,” comes much closer to what I propose here. Interrogating our methods, both computational and hermeneutic, for their underlying assumptions allows us to understand their interaction and the mutually constitutive ways through which they alter our understanding of the objects of our analysis. Taking the results of our computational analysis seriously and using them as a platform for our interpretive practices demands that we understand what these analytic methods actually do.

Herein lies the risk of the analytic transformation that I am proposing. If we give our computational results the same interpretive weight as our humanities theories, then we need to fully understand how they function. If we do not, then we risk resting our hermeneutics on a set of faulty assumptions or a contingent, nonreplicable set of results. The methods of the computational humanities often include statistical validation practices: these can aid us in differentiating between a model that has gone awry and returned arbitrary, random, or otherwise mathematically invalid results and a working model that returns an unexpected yet productive set of results that do not conform to our habituated hypotheses about the object that we are modeling. Ringler’s implicit critique of tools (black boxes that obfuscate the actual computational work to facilitate ease of use) as opposed to methods can help point us toward the ways in which we can validate our models without recourse to a ground truth that, in the humanities, is frequently only another model itself. Only by fully engaging with the methods that inform our analysis, both those that come from the humanities and those that are computational or statistical in nature, can we take full advantage of the radical nature of the insights that computation can bring to our research.

When we seek to operationalize a literary or textual concept, or when we train a supervised model, we presume (either tacitly or explicitly) to begin at the point at which we understand the object of our analysis. These types of analyses are valuable in the field of computational humanities because they serve to help test our assumptions of how certain phenomena work at very large or small scales, or they can redirect our reading toward particularly meaningful examples within a corpus of text. But all of these approaches overlook the radical reconfigurations that the proxies for the objects of our analysis must undergo in order to be computationally tractable and, in turn, the ways in which these reconfigurations can help us understand the objects themselves as they are differently represented in the quantitative analysis. By approximating authors as distributions of word frequencies and not insisting on an explanatory connection between these frequencies and the mental state of a real, embodied author, we can expand our definition of what an author might be and reveal the multiplicity of factors that influence a text’s origin. By modeling characters as patterns of referentiality, we can begin to question how they are being referred to and who is doing the referring, allowing us to reconfigure a character as a nexus of textual possibility that exists between the rhetoric of the text and the reader within the matrix of a character-system. And by forgoing the a priori close reading of topics in an attempt to model genre or theme within a corpus of texts, we can use the resulting probabilities to uncover clusters of texts that demonstrate the malleability of these terms and the ways in which classifications of certain kinds of text bring together multiple systems of referentiality simultaneously.

This approach not only gives us new purchase on the ways in which computational text analysis can be of use in the humanities, but it also restores a crucial potential of the computational humanities and the digital humanities more broadly. By being guided by the methods that we adopt, by paying attention to how we approximate textual phenomena through quantification, and by taking the results of our clusters and models as reflexively interpretable objects themselves (rather than simply as proxies for the true objects that we seek to study), we can use the methods and strategies of computation to put pressure on our received theoretical and historical objects and uncover alternative ways of thinking about the material that we study.

Notes

1. This is the premise behind, for example, Ted Underwood’s study of “the life spans of genres” in Distant Horizons or even David Bamman’s promise of a “born-literary natural language processing” in chapter 3 of this volume.
2. Ted Underwood, in Distant Horizons, has argued that we should focus our efforts on using supervised modeling to reveal effect sizes of variables that differentiate pre-classified groups rather than creating “arbitrary measurements” to operationalize concepts (Underwood, 181). In practice, both equally involve the fitting of measurement to theory. When adopting a modeling approach, the selection of training groups and feature sets does the work of creating a quantitative proxy for a textual phenomenon: training a model on a group of texts divided by genre whose features are word frequencies still requires an operative theory of genre.
3. The wealth of articles on “improving” results through these three methods demonstrate the centrality of this approach. See, for example, Rybicki and Eder; Hoover, “Testing”; Elango; J. Wu and W. Ma; Mimno et al. Lisa Rhody, in “Topic Modeling and Figurative Language,” offers an approach more in line with what I suggest here; however, as we will see in the discussion of topic modeling below, her own discourse on topic modeling falls into a slightly different binary.
4. This deformation/defamiliarization of the text through computational means is a hallmark of Stephen Ramsay’s Reading Machines: Toward an Algorithmic Criticism (Ramsay).
5. The precise number of terms varies between practitioners. John Burrows, in his foundational article, tests a range of most frequent words (MFWs), from 200 down through 40, settling on 150 words as the optimal number (Burrows, “‘Delta,’” 274). David Hoover argues that expanding the wordlist to 800 MFW (while filtering the list for contractions and personal pronouns) leads to improvements in author classification, while others, such as Jan Rybicki and Maciej Eder, have demonstrated that while success varies with language, word sets with even the top 2,000 words removed still show some success in author classification tasks (Hoover, “Testing”; Rybicki and Eder). Because the debate over the specific feature set used in authorship attribution is not germane to my argument in this chapter, for my work here I adopt the optimizing improvements to stylometry suggested by Peter Smith and W. Aldridge, who argue that limiting the feature set to between 200 and 300 MFWs and using cosine similarity measurements (as opposed to Manhattan or Euclidean distances) greatly improve classification (Smith and Aldridge).
6. My analysis here follows Peter Smith and W. Aldridge’s suggestion of comparing the normalized z-scores of the 200 MFW in a corpus of texts using cosine similarity, with very few modifications or refinements to the initial clustering. The plot was generated by a t-stochastic neighbor embedding of the z-scores of the feature set and is largely in agreement with a clustering solution using the cosine similarity between all texts (Smith and Aldridge).
7. This is true in the textual analysis settings that I describe here, although it is worth noting that there is a robust psychological literature on the meaning of function word usage, for example (Chung and Pennebaker; Simmons, Gordon, and Chambless). The gestures toward cognitive theory in the examples that I provide above, however, are typically used as a way of explaining the relationship between feature and author without recourse to an actual engagement with psychological theory.
8. In literary contexts, the most famous case of using stylometric features to detect authorial mental states is the claim that the late works of Agatha Christie feature a dramatically reduced vocabulary and syntactic complexity, indicating an onset of dementia. This is work that has been extended to the senescent work of other authors (Le et al.). These metrics, however, are not the same as the feature set used in authorship attribution and are often only meaningful, in this sense, in intra-corpus comparisons. In Figure 2.1, the works of Christie, both early and late, form one coherent cluster.
9. It is worth noting that the three clusters do not separate neatly into the three different pen names under which McBain published; however, there does seem to be a temporal distinction separating the groups, with most of the earlier works in the left-hand group. It is also notable that Walter Mosley’s work also separates into three clusters with his earliest novel in the “Easy Rawlins” series and the first novels in the “Socrates Fortlow” series sharing the far-left cluster.
10. The traditional concept of the author, as a unitary, embodied individual with authority over the work, such as I describe here, is currently facing challenges from a number of directions. Postcolonial critiques have revealed the extent to which this concept of the “author” is originally a European construct, and many are working to explore alternate modalities of assigning origins to text—for example, the In the Same Boats project by Kaiama Glover and Alex Gil. Other researchers are putting pressure on the author as a legal fiction, particularly in the face of new technologies that complicate the concept of the author and, by extension, the equally Western idea of copyright law (Tay, Sik, and Chen).
11. While this split is echoed in a dendrogram cluster based on the cosine similarity (again following Smith and Aldridge), a linear discriminant analysis (with a reduced variable set) clusters all of Grafton’s works as by the same author with high probability.
12. When the computational misclustering of authors through stylometry has been taken seriously, it has mostly happened under the rubric of the late style of authors, such as Hoover’s work on Henry James or Andrew Piper’s exploration of Said’s theory (Hoover, “Corpus Stylistics”; Piper, 167–77).
13. It is no coincidence that Alex Woloch centralizes Pride and Prejudice in his seminal study of character-system, The One vs. the Many: works on literary character from a humanities perspective reliably reference Austen’s corpus (Woloch; Frow; Lynch). In NLP research, Austen’s corpus has become a “gold standard” of testable material, partly because of its availability as an out-of-copyright work but also because of its ubiquity and the various heavily annotated copies held by prominent NLP practitioners (Pereira and Paraboni; Dekker, Kuhn, and van Erp; Bird, Klein, and Loper; Manning and Schutze).
14. For example, the foundational paper on David Bamman’s BookNLP, the method that I employ here, not only uses Austen as one of the sample groups but also makes frequent reference to Austen’s characters as exemplars of the ways in which characters behave in texts (Bamman, Underwood, and Smith). Similarly, in his work on LitBank in this volume, Bamman again uses Austen’s Pride and Prejudice as a starting point for explaining coreference resolution, describing the difficulty in assigning the label of “Miss Bennet” to the “correct” Bennet sister.
15. Even Burrows’s early study of authorship attribution sought to resolve Austen’s character in the same manner as authors, to mixed results (Burrows).
16. Named entity recognition describes the task of extracting proper nouns from a text and resolving these to the kind of entity that they represent (most often persons or places, but sometimes organizations, corporations, or even ideologies). Coreference resolution applies primarily to person entities and describes the task of resolving the individual mentions of a single person (or character) into a single entity. This can include both resolving variations of names (“Elizabeth,” “Miss Bennet,” or “Elizabeth Bennet”) as well as the pronouns (she, her, hers) that refer to the character throughout the text.
17. Andrew Piper explores the lack of individuality among characters as a path toward identifying the specifically gendered social functions that they maintain in the novel, using, of course, Pride and Prejudice (Piper, 123–24).
18. I am indebted for this reading to my collaborators on the Literary Lab Project “The Grammar of Gender,” particularly Regina Ta, who first noticed the discrepancies that led to this insight.
19. Elizabeth’s doubling of Darcy’s character here is very much in line with Mary Poovey’s description of her “psychological economy” as this is a key moment in the novel at which “she directs her intelligence toward defending herself against emotional vulnerability” (Poovey, 198). In its narratological implications, it is also deeply resonant with Austen’s use of free indirect discourse, which Daniel Gunn reads as “primarily an imitation of figural speech or thought, in which the narrator echoes or mimics the idiom of the character” (Gunn, 37, emphasis in original). Although here, it is Elizabeth who stands in for the narrator mimicking the idioms of Darcy, it appears that the NER is picking up on what Gunn calls “stylistic contagion” in Austen’s characterization.
20. Ben Schmidt’s early critique of this tendency is particularly cogent (Schmidt). David Mimno and colleagues’ metric of topic coherence can be used to assess topics for their relative meaningfulness to experts in the field on which the corpus is based (Mimno et al.).
21. Much of David Blei’s early work on topic models focused specifically on the relevance of topic modeling to information retrieval (Blei and Lafferty).
22. It is notable, for example, that Piper finds more meaning in a cluster of words around art and water that emerges in passages identified by an entirely different topic in his exploration of “reading topologically” in Enumerations (Piper, 75–83). Underwood’s Distant Horizons only mentions topic models once, on the final pages of his chapter on the “risks of distant reading” (Underwood, 168).
23. According to Blei, a topic model “discovers a set of ‘topics’—recurring themes that are discussed in the collection” (Blei). The relationship between topics and themes was also explored in the eighth pamphlet of the Stanford Literary Lab, “On Paragraphs.” Here, the key finding, that paragraphs are the unit of the topical coherence in fiction, is linked to the idea that topics represent themes or motifs. Although compelling, this connection is purely interpretive and is based on the finding of the relationship between paragraphs and topics (Algee-Hewitt, Heuser, and Moretti).
24. More contemporary approaches are no more helpful in understanding themes, which remain significantly under-theorized in literary study. For example, Shlomith Rimmon-Kenan’s contribution to the volume Thematics: New Approaches offers what seems to be a prescient version of topic modeling as she connects theme to “aboutness” and then to “topic,” which is a “construct put together from discontinuous elements of the text” (Rimmon-Kenan, 14). In the end, however, she is only able to offer a partial and incomplete catalog that points to the indeterminacy of the term, as are her fellow critics in the volume.
25. The four categories are chained, intruded, random, and unbalanced (Mimno et al., 264). Of the four, the unbalanced topic (which contains a mix of general and specific words) might come closest to what I describe here, but closer still would be a model that contained a number of topics with specific words and one with general terms, or vice versa.
26. Lisa Rhody uses Latent Dirichlet Allocation topic modeling to explore the figurative language of poetry; in “Topic Modeling and Figurative Language,” she focuses on topics that do not, in her words, exhibit “thematic clarity” (Rhody). Here too, however, “thematic” remains deeply undertheorized, as she contrasts a “thematic” topic such as “genetics” in a science corpus with a topic from Anne Sexton’s “The Starry Night” that she says “draws from the language of elegy” with a relation to “death, loss and internal turmoil.” Such as a topic, Rhody argues, is not “thematic”; however, when compared with Rimmon-Kenan’s discussion of the thematics of Tolstoy’s The Death of Ivan Illych, which she reads as “the life-giving, because insight providing, power of death,” Rhody’s reasoning behind differentiating topics on the basis of whether or not they mark “theme” becomes, at best, opaque, and at worst, arbitrary (Rimmon-Kenan, 16).
27. See American Fiction, 1774–1920 (Farmington Hills, Mich.: Gale, 2020), https://www.gale.com/c/american-fiction-1774-1920. This corpus, based on the Lyle H. Wright bibliography of American fiction, is one of the largest bibliographically based corpora of American fiction. It purports to include all novels by American authors published by major publishing houses between 1774 and 1920. It is important to note, then, that this, by definition, excludes crucial categories of American novels, particularly those published by smaller, bespoke publishing houses or those circulated in magazines, newspapers, or by manuscript. Considering that these missing categories were largely peopled by women authors and authors of color, who systematically lacked access to the publishing and bookselling industry, the bibliography and therefore the corpus skews both white and male in its representation of American fiction. While there are novels by authors of color and women authors in the corpus, they are underrepresented. My demonstration here should therefore be taken to represent a slice of fiction written in America during this period, not “American fiction” writ large.
28. My choice of 150 topics in this analysis was based on a sampling of other models with different numbers of topics (75, 100, 150, 175, and 200). All demonstrated some resolution around genre, although the clusters in the 150 topic model above were more distinct.
29. This is not to say that there are not specific concerns to the sea tale or the college novel, but those concerns are a product of the setting (the ocean or a campus), which sets the stage for the specific types of actions that can take place within the genre.
30. Contemporary religious fiction, for example, groups around topic 103: church, god, Christian, faith, minister; while historical religious fiction is clustered by topic 91: god, shall, lord, great. Again, the difference here can be interpreted once we know what the clusters represent, but this interpretation rests on the metadata of the novels themselves and not from the topics that, to a naive reader, are mostly undifferentiable.
31. Although my examples here are drawn from literary criticism and computational text analysis, my conclusions are equally extensible to other humanities fields that bring computational methods to bear on supposedly “real world” objects. The use of Geographic Information Systems in the spatial humanities, for example, similarly creates models of complex relationships between society, geography, history, and space that frequently rest on ideas of a literal “ground truth” but whose “errors” and deformed geographies frequently point to critical, and often overlooked, assumptions in the model itself. Ideas of “space” and “history” are equally as abstract as “theme” or “topic.”
32. Given the origin of the single genius author of the literary work as a function of white, male authorial practices in the Romantic period, a computational approach that calls this version of authorship into question can also create room for other (non-white, non-male) authorial traditions to surface.
33. See chapter 1 in this volume.

Bibliography

Algee-Hewitt, Mark, Ryan Heuser, and Franco Moretti. On Paragraphs: Scale, Themes and Narrative Form. Stanford, Calif.: Stanford Literary Lab, 2015.
Argamon, Shlomo, Sushant Dhawle, Moshe Koppel, and James W. Pennebaker. “Lexical Predictors of Personality Type.” Proceedings of the 2005 Joint Annual Meeting of the Interface and the Classification Society of North America, 1–16. St. Louis: Interface, 2005.
Auerbach, Erich. Mimesis: The Representation of Reality in Western Literature–New and Expanded Edition. Translated by Willard R. Trask. Princeton, N.J.: Princeton University Press, 2013.
Austen, Jane. Pride and Prejudice. Edited by Vivian Jones. London: Penguin, 2006.
Baayen, Harald, Hans van Halteren, Anneke Neijt, and Fiona Tweedie. “An Experiment in Authorship Attribution.” 6th JADT (2002).
Bamman, David, Ted Underwood, and Noah A. Smith. “A Bayesian Mixed Effects Model of Literary Character.” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Volume 1: Long Papers, 370–79. Baltimore: Association for Computational Linguistics, 2014.
Barthes, Roland. S/Z. Translated by Richard Miller. Vol. 76. New York: Hill and Wang, 1974.
Bird, Steven, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol, Calif.: O’Reilly Media, 2009.
Blei, David M. “Topic Modeling and Digital Humanities.” Journal of Digital Humanities 2, no. 1 (2012): 8–11.
Blei, David M., and John D. Lafferty. “Topic Models.” Text Mining: Classification, Clustering, and Applications 10, no. 71 (2009): 34.
Burrows, John. “‘Delta’: A Measure of Stylistic Difference and a Guide to Likely Authorship.” Literary and Linguistic Computing 17, no. 3 (2002): 267–87.
Burrows, John Frederick. Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon Press, 1987.
Chung, Cindy, and James W. Pennebaker. “The Psychological Functions of Function Words.” In Social Communication, edited by Klaus Fiedler, 343–59. Vol. 1. Sussex, UK: Psychology Press, 2007.
Cowles, Gregory. “Before Sue Grafton Was a Star: Inside the List.” New York Times Books, January 5, 2018.
Dekker, Niels, Tobias Kuhn, and Marieke van Erp. “Evaluating Named Entity Recognition Tools for Extracting Social Networks from Novels.” PeerJ Computer Science 5 (2019): e189.
Diederich, Joachim, Jörg Kindermann, Edda Leopold, and Gerhard Paass. “Authorship Attribution with Support Vector Machines.” Applied Intelligence 19, no. 1–2 (2003): 109–23.
Eder, Maciej. “Style-Markers in Authorship Attribution: A Cross-Language Study of the Authorial Fingerprint.” Studies in Polish Linguistics 6, no. 1 (2011).
Elango, Pradheep. “Coreference Resolution: A Survey.” Technical Report, University of Wisconsin, Madison, 2005.
Foucault, Michel. Language, Counter-Memory, Practice: Selected Essays and Interviews. Ithaca, N.Y.: Cornell University Press, 1980.
Froehlich, Heather. “Dramatic Structure and Social Status in Shakespeare’s Plays.” Journal of Cultural Analytics 5, no. 1 (2020).
Frow, John. Character and Person. Oxford: Oxford University Press, 2014.
Glover, Kaiama, and Alex Gil. In the Same Boats, accessed February 2022. www.sameboats.org.
Goldstone, Andrew, and Ted Underwood. “What Can Topic Models of PMLA Teach Us about the History of Literary Scholarship.” Journal of Digital Humanities 2, no. 1 (2012): 39–48.
Gunn, Daniel P. “Free Indirect Discourse and Narrative Authority in ‘Emma.’” Narrative 12, no. 1 (2004): 35–54.
Hoover, David L. “Corpus Stylistics, Stylometry, and the Styles of Henry James.” Style 41, no. 2 (2007): 174–203.
Hoover, David L. “Testing Burrows’s Delta.” Literary and Linguistic Computing 19, no. 4 (2004): 453–75.
Juola, Patrick. Authorship Attribution. Vol. 3. Boston: Now Publishers, 2008.
Lacan, Jacques. Écrits. Translated by Bruce Fink. New York: W. W. Norton, 2006.
Lau, Jey Han, Karl Grieser, David Newman, and Timothy Baldwin. “Automatic Labelling of Topic Models.” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1536–45. Portland, Ore.: Association for Computational Linguistics, 2011.
Le, Xuan, Ian Lancashire, Graeme Hirst, and Regina Jokel. “Longitudinal Detection of Dementia through Lexical and Syntactic Changes in Writing: A Case Study of Three British Novelists.” Literary and Linguistic Computing 26, no. 4 (2011): 435–61.
Lynch, Deidre Shauna. The Economy of Character: Novels, Market Culture, and the Business of Inner Meaning. Chicago: University of Chicago Press, 1998.
MacDonald, Erin E. Ed McBain/Evan Hunter: A Literary Companion. Jefferson, N.C.: McFarland & Company, 2012.
Manning, Christopher, and Hinrich Schutze. Foundations of Statistical Natural Language Processing. Cambridge, Mass.: MIT Press, 1999.
McCallum, Andrew Kachites. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu. 2002.
Mimno, David, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. “Optimizing Semantic Coherence in Topic Models.” Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 262–72. Edinburgh, UK: Association for Computational Linguistics, 2011.
Pereira, Daniel Bastos, and Ivandré Paraboni. “A Language Modelling Tool for Statistical NLP.” Anais do V Workshop em Tecnologia da Informação e da Linguagem Humana–TIL, 1679–88. Rio de Janeiro: Congresso da SBC, 2007.
Piper, Andrew. Enumerations: Data and Literary Study. Chicago: University of Chicago Press, 2018.
Poovey, Mary. The Proper Lady and the Woman Writer: Ideology as Style in the Works of Mary Wollstonecraft, Mary Shelley, and Jane Austen. Chicago: University of Chicago Press, 1985.
Propp, Vladimir. Morphology of the Folktale. Vol. 9. Austin: University of Texas Press, 2010.
Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Champaign: University of Illinois Press, 2011.
Rhody, Lisa M. “Topic Modeling and Figurative Language.” Journal of Digital Humanities 2, no. 1 (2012).
Rimmon-Kenan, Shlomith. “What Is Theme and How Do We Get at It?” In Thematics: New Approaches, edited by Claude Bremond, Joshua Landy, and Thomas Pavel, 9–20. Albany: State University of New York Press, 1995.
Rybicki, Jan, and Maciej Eder. “Deeper Delta across Genres and Languages: Do We Really Need the Most Frequent Words?” Literary and Linguistic Computing 26, no. 3 (2011): 315–21.
Schmidt, Benjamin M. “Words Alone: Dismantling Topic Models in the Humanities.” Journal of Digital Humanities 2, no. 1 (2012): 49–65.
Simmons, Rachel A., Peter C. Gordon, and Dianne L. Chambless. “Pronouns in Marital Interaction: What Do ‘You’ and ‘I’ Say about Marital Health?” Psychological Science 16, no. 12 (2005): 932–36.
Smith, Peter W. H., and W. Aldridge. “Improving Authorship Attribution: Optimizing Burrows’ Delta Method*.” Journal of Quantitative Linguistics 18, no. 1 (2011): 63–88.
So, Richard Jean, Hoyt Long, and Yuancheng Zhu. “Race, Writing, and Computation: Racial Difference and the US Novel, 1880–2000.” Journal of Cultural Analytics 3, no. 2 (2018).
Stamatatos, Efstathios. “A Survey of Modern Authorship Attribution Methods.” Journal of the American Society for Information Science and Technology 60, no. 3 (2009): 538–56.
Tay, Pek San, Cheng Peng Sik, and Wai Meng Chen. “Rethinking the Concept of an ‘Author’ in the Face of Digital Technology Advances: A Perspective from the Copyright Law of a Commonwealth Country.” Digital Scholarship in the Humanities 33, no. 1 (2018): 160–72.
Underwood, Ted. Distant Horizons: Digital Evidence and Literary Change. Chicago: University of Chicago Press, 2019.
Vermeule, Blakey. Why Do We Care about Literary Characters? Baltimore: Johns Hopkins University Press, 2010.
Woloch, Alex. The One vs. the Many: Minor Characters and the Space of the Protagonist in the Novel. Princeton, N.J.: Princeton University Press, 2009.
Wu, J., and W. Ma. “A Deep Learning Framework for Coreference Resolution Based on Convolutional Neural Network.” IEEE 11th International Conference on Semantic Computing (ICSC), 61–64. San Diego: IEEE Computer Society Press, 2017.

Born Literary Natural Language Processing

Show the following:

Adjust appearance:

Notes

Computing Criticism

Distributions of Authorship

Character Patterns

Topic and Theme

Reading the Model

Notes

Bibliography

Annotate