Manufacturing Visual Continuity

Generative Methods in the Digital Humanities

Fabian Offert and Peter Bell

The great promise of the computational humanities is a humanities scholarship on par with computer science, a deep exploration of state-of-the-art computational methods and their application to humanities questions. In this chapter, we argue that this technical focus requires a critical complement that becomes particularly obvious at the intersection of computer vision and machine learning in digital art history and digital visual studies—sometimes also referred to as the visual digital humanities (Münster and Terras). We suggest understanding “critical complement,” in the original sense of critical theory (Horkheimer, “Traditionelle und Kritische Theorie”), as the necessity to fix a historical disjunction between theory and practice: while the computational humanities have caught up to the technical state of the art in computer science practically, the methodological reflection provided by the concepts of computer science has not been adopted in a productive manner. Hence, we propose to review the fundamental computational paradigms that currently define technical experimentation in digital art history and digital visual studies. In that way, our approach takes seriously, and specifies with regard to digital art history, Taylor Arnold and Lauren Tilton’s call for collaboration with statistics in Debates in the Digital Humanities 2019; “the digital humanities should welcome statistics to the table, and this begins by better acknowledging the critical role of statistics in the field” (Arnold and Tilton, “New Data?”). Echoing Hannah Ringler in chapter 1 in this volume, we argue that not only do we need to consider “tools as interpretation and interpretation of tools as processes that need hermeneutics,” but we should also reconsider both the theoretical and practical contributions of all disciplines involved.

Specifically, we propose that the statistical distinction between generative and discriminative approaches can not only inform the methodological discourse in digital art history and digital visual studies but also provide a starting point for the exploration of previously disregarded generative machine learning techniques. While computational literary studies and related subdisciplines of the digital humanities have already implicitly embraced generative methods, the visual digital humanities lack equivalent tools. Here, we propose to investigate generative adversarial networks (GANs) as a machine learning architecture of interest. Further, we suggest that the manufactured continuity that GANs provide through advanced techniques like latent space projection can guide our interpretation of an image corpus.

Studying Images with Machine Learning

Depending on the disciplinary perspective, the (digital) study of (digital) images, with the help of machine learning, is either a rather recent development or can be traced back to the very beginnings of computing: (connectionist) machine learning, after all, started as computer vision (e.g., Papert, “The Summer Vision Project”). At the same time, it is only recently that high-level machine learning–based tools and toolkits for digital art history, like the Distant Viewing Toolkit (Arnold and Tilton), PixPlot (Duhaime), imagegraph.cc (Impett), or imgs.ai (Offert and Bell) have started to emerge, due to the increasing availability of computational resources, software libraries, technical competence, and code availability in the digital humanities. Within the relatively new field of machine learning–based digital art history and digital visual studies, two major and interconnected technical directions can be identified: the use of machine learning for image retrieval and the use of machine learning for image classification. There, the focus lies on the identification of objects (Crowley and Zisserman, “The State of the Art”), poses (Bell and Impett, “The Choreography of the Annunciation through a Computational Eye”), styles, and artists (Elgammal et al., “The Shape of Art History in the Eyes of the Machine”). A similar development can be seen in the study of moving images in digital film studies and the digital humanities (DH) for A/V.1

As Lev Manovich observes in Cultural Analytics, fields like digital art history and digital visual studies examine cultural data “at scale”: entire oeuvres, periods, or collections become objects of investigation. Image retrieval and image classification techniques serve to answer two different sets of questions that result from this increased scale. Image retrieval deals with questions of search: How can singular images be found within large-scale datasets without additional identifying information (i.e., metadata)? Machine learning–based solutions to retrieval problems usually involve image embeddings in combination with clustering and dimensionality reduction algorithms. Image classification, on the other hand, implies the automated labeling of images—for instance, with regard to traditional art-historical categories like provenance. Machine learning–based solutions to image classification problems often use state-of-the-art deep convolutional neural networks and high-quality training and test corpora. Importantly, both image retrieval and classification could be described as “forward” problems: an image serves as the input to a machine learning pipeline that outputs some high-level descriptor. The conceptual opposite here is a generative process, the production of images from high-level descriptors. However, while generative methods have seen a surge of popularity in artistic (Offert, “The Past, Present, and Future of AI Art”) and scientific (Offert, “Latent Deep Space”) contexts, they seem to be more or less absent from computational humanities work.

It is important to keep in mind, however, that these distinctions are purely application-based. On a theoretical level, the rules of statistics, even if they are applied to web-scale corpora of complex, high-dimensional data, stand as unifying principles behind all machine learning approaches. While this is obvious to computer science practitioners, it is often disregarded when machine learning is discussed in the humanities context, mirroring an often-diagnosed epistemological split between computer scientists and engineers and digital humanities scholars (Mercuriali, “Digital Art History and the Computational Imagination”). We argue that, somewhat counterintuitively, statistical notions offer an alternative critical perspective on machine learning in the humanities context. We suggest that the foundational statistical distinction between discriminative and generative approaches (Ng and Jordan) can be used to guide the further development of the computational humanities.

Discriminative versus Generative Approaches

The digital humanities have often been broadly criticized for the mere use of quantitative methods, as eloquently summarized in Ted Underwood’s blog post, “It Looks like You’re Writing an Argument against Data in Literary Study . . .” While some of these critiques from the early days of the field still resonate, and general critiques of quantitative methods occasionally reappear with force, as in Claire Bishop’s critique of digital art history, generally, a consensus has grown that a blanket rejection has no grounding in the reality of digital humanities work. Instead, the focus of critique has shifted to the epistemological implications of exploratory data analysis.

One of the most powerful critiques that follows this trajectory is Nan Z. Da’s article, “The Computational Case against Computational Literary Studies.” Da writes that “all the things that appear in [computational literary studies]—network analysis, digital mapping, linear and nonlinear regressions, topic modeling, topology, entropy—are just fancier ways of talking about word frequency changes” (607). Based on a formal distinction between discriminative and generative methods, however, we can see that some of these methods are not the same.

To distinguish between two kinds of objects—say, apples and oranges—based on a dataset of labeled images of both, we can imagine two possible strategies of classification. We can design a model that learns the most salient difference between apples and oranges from the dataset. A good candidate for a distinctive visual feature would be color: apples are usually red; oranges are usually orange. The model then uses these most salient features to classify new, unseen samples. This is the discriminative approach. The generative approach, however, learns the complete distribution of visual features for both apples and oranges. Apples come in different shades of red, yellow, and green; oranges come in different shades of orange. The model then classifies new, unseen samples by comparing their visual feature distribution to the visual feature distribution for apples and the visual feature distribution for oranges. The discriminative approach attempts to model a decision boundary between classes (it literally learns “where to draw the line” between apples and oranges), while the generative approach attempts to model the actual distribution of each class. The generative approach essentially asks: What is the most likely source of the signal we are seeing, while the discriminative approach simply looks for a way to distinguish one signal from the other and does not take the source into account.

In fact, linear regression and topic modeling fall on opposite ends of the spectrum between discriminative and generative approaches, and the prevalence of latent Dirichlet allocation (LDA), also known as (one variation of) topic modeling, in computational literary studies and in the digital humanities in general, points to an even stronger claim: the digital humanities “intuitively” choose generative over discriminative approaches because they are more aligned to humanities data.

Why do the digital humanities gravitate toward generative approaches? Because generative approaches mitigate, at least in part, the alienation and the general inadequacy of quantitative methods vis-à-vis cultural artifacts. Quantitative methods, obviously, can never fully represent cultural artifacts. Precisely, both the sampling of cultural artifacts into data and the modeling of this data are reductive. In the domain of modeling, however, generative approaches stay as close to the material as possible, while discriminative approaches essentially “ignore” the material for the sake of classification. In other words, generative approaches, while not being able to mitigate the problems introduced by sampling, can mitigate the problems introduced by modeling within the realm of what can be modeled. Generative approaches will always allow for multiple interpretations to coexist, while discriminative approaches provide one, and only one, interpretation in the form of a classification decision.

Regardless, many of the problems discussed in Da’s article stand with or without generative machine learning. Shoddy hypothesis building or the lack thereof, intentional or unintentional over- or misinterpretation of the empirical evidence quantitative methods can offer, or too-broad applications of narrow technical concepts are problematic irrespective of the kind of model involved. Hence, a focus on generative approaches does not “solve” or even explicitly address these issues. Generative approaches do not magically produce a “self-reflexive account of what the model has sought to measure and the limitations of its ability to produce such a measurement,” as Richard Jean So writes (“All Models Are Wrong”). On the contrary, generative methods tend to actually and implicitly encourage the problematic “exploratory” approach.2 This became a central argument in the discussion following Da’s article (Underwood, “Critical Response II”; Da, “Critical Response III”).

Moreover, famously, the existence of “raw data” is an illusion (Gitelman, Raw Data Is an Oxymoron; Latour, Science in Action; Dobson, Critical Digital Humanities), and the existence of “neutral algorithms” even more so (Benjamin, Race after Technology; Buolamwini and Gebru, “Gender Shades”). Thus, if we propose that generative methods “stay as close to the material as possible,” we do not imply the absence of subjective guidance through the design or selection of algorithms and datasets. Indeed, it is not only dataset bias that shapes machine learning models, but inductive biases induced by pragmatic architectural decisions often further entangle subjective and machinic perspectives. In computer vision in particular, specific models will “see” things in their own idiosyncratic ways (Offert and Bell, “Perceptual Bias”; Geirhos et al., “Shortcut Learning in Deep Neural Networks”).

What a critical distinction between generative and discriminative approaches offers, regardless, is a prospective path through current and future experimental work in computer science, where the digital humanities, we argue, need to critically consider the distinction between generative and discriminative methods in the evaluation of new, experimental tools and methods—all while keeping in mind that the maximum benefit of all machine learning models is a “useful-wrong” model in the sense of that George Box uses the term (“Robustness in the Strategy of Scientific Model Building”)—that is, a model that stays “reasonably” close to the material.

Generative Adversarial Networks

As Leonardo Impett has pointed out in “Open Problems in Computer Vision,” the computer vision problems that art history is concerned with are almost exclusively search-related—that is, they are classification problems. What if digital art history would start focusing on generative methods instead? Would a closer relation to the material also establish itself in the domain of images?

Generative methods are already implicitly employed in the neural network–based clustering of images, which has become increasingly more popular in digital art history (Wevers and Smits, “The Visual Digital Turn”). When embeddings are used, a learned subsystem of a classifier is repurposed exactly for its generative properties, which is also why recent research (Graving and Couzing, “VAE-SNE”) suggests building explicitly generative systems for the specific purpose of clustering. Even more recently, better contrastive learning techniques like CLIP (Radford et al., “Learning Transferable Visual Models from Natural Language Supervision”) have produced embeddings of almost universal applicability, as training corpora are continuously extended and now include large parts of the internet.3 We argue that beyond these implicitly useful applications, explicitly generative methods can become for the visual domain what LDA became for the text domain: a universal instrument for the guided, unsupervised exploration of large-scale corpora.

Generative adversarial networks, first introduced by Ian Goodfellow and colleagues (“Generative Adversarial Nets”), leverage game theory to model the probability distribution of a corpus by means of a minimax game between two deep convolutional neural networks. Effectively, generative adversarial networks define a noise distribution p_z that is mapped to data space via G(z; θg), where G is a “generator,” an “inverted” convolutional neural network with parameters θg that “expands” an input variable into an image, rather than “compressing” an image into a classification probability. G is trained in conjunction with a “discriminator,” which is a second deep convolutional neural network D(z; θd) that outputs a single scalar. D(x) represents the probability that x came from the data rather than G. Note that the whole system, not just the “generator,” realizes the generative approach, as the whole system is needed to model p(x|y = 0). Also note that the system effectively learns a compression: a high-dimensional data space with dimensions > z is compressed to be reproducible from a data space with dimensions z. This means that details will always be lost, a detail that we should keep in mind.

The original article by Goodfellow demonstrates the potential of generative adversarial networks to synthesize images in particular by synthesizing new handwritten digits from the MNIST dataset. The MNIST dataset, however, has a resolution of 28 × 28 pixels (i.e., several orders of magnitude below standard photo resolutions). Scaling up the approach proved difficult, and while a lot of effort was made to go beyond marginal resolutions, progress was slow (for machine learning) until very recently, when StyleGAN (Karras, Laine, and Aila, “A Style-Based Generator Architecture”), a generative adversarial network that implemented several significant optimization tricks to mitigate some of the limitations of generative adversarial networks, was introduced. Current-generation models like StyleGAN2 (Karras et al., “Analyzing and Improving the Image Quality of StyleGAN”), which present another improvement over the original StyleGAN, are able to produce extremely realistic samples from large image corpora. Finally, in 2021, projects like OpenAI’s DALL-E (Ramesh et al., “Zero-Shot Text-to-Image Generation”) have shown the promise of architectures like deep variational autoencoders to potentially surpass GANs in the generation of realistic images.

Regardless of the concrete architecture, however, generative samples, to the humanist, feel uncanny. GANs obviously learn something (maybe everything?) about a corpus, and GAN samples “tick all the boxes” at the first glance. At the same time, GANs seem almost useless. What knowledge is there to gain from a model that essentially learns to re-create approximations to what exists and nothing about what exists? This questionable utility is amplified by necessary imperfections, on the one hand—after all, GANs learn a compression of an image corpus—and by a fundamental artificial, nonhuman quality, on the other. At least since the ancient Greeks, when Aristippus, on seeing geometric drawings in the sand, famously exclaimed, “Let us be of good cheer, for I see the traces of man,” image-making has been understood as an exclusively human faculty—homo pictor as the synthesis of homo sapiens and homo faber (Jonas, “Homo Pictor”).

Interestingly, for a long time and despite impressive early results, the utility of generative adversarial networks was not entirely clear in the computer science community either. And while today there are obvious applications in digital image processing (inpainting, superresolution, image-to-image translation, style transfer) and manipulation (deep fakes), the epistemological qualities of GANs—that is, their role in scientific (and, we argue, humanist) processes of discovery—are still not fully explored. And, in fact, such an exploration would need to start not from questions of optimization or architecture search but from the image artifacts themselves; we would need a way to properly study, analyze, and interpret images produced by generative methods. In the following, we sketch what we see as the inherent potential and limitations of such an approach.

Latent Spaces as Continuous Image Spaces

We commonly assume that works of art relate to specific sociohistorical contexts. Their aesthetic autonomy is an effect of neither fully taking in, nor completely rejecting, this sociohistorical context. According to Theodor W. Adorno (Ästhetik), works of art act as a “social antithesis to society” (19), and “art’s double character as both autonomous and fait social is incessantly reproduced on the level of its autonomy” (16). Different approaches to interpretation propose different ways of navigating this dialectic relationship. What, then, is the sociohistorical context of a latent space image, of an image that has never existed in the real world?

Importantly, any latent space image is entirely determined by the corpus of images used to train the GAN that produced the latent space. In a sense, the sociohistorical context of a latent space image is thus the combined sociohistorical context of the training corpus. But while this understanding can be potentially productive for a very limited number of cases—those in which the training corpus is extremely homogeneous—it will usually not take us very far, particularly in the case of iconographic corpora that intentionally span multiple sociohistorical contexts.

Nevertheless, GAN latent spaces are imaginary spaces. They are reconstructions of the defining features of a corpus, and exploring such spaces is not the same thing as exploring the corpus itself. This imaginary quality, however, which is deeply problematic in any other application of generative methods (Offert, “Latent Deep Space”), can precisely be of use in the digital humanities context. Here, GAN samples are not mistaken for valid information generated from nothing, as in so many recent examples in the sciences, but can be understood as an additional means to ask questions about the information we do have about the corpus at hand. GAN latent spaces are, for better or worse, “filled to the brink” with images. This also means that for each given sample, there exist millions of other samples that look almost identical to this sample, except for tiny details. It also means that, between each two samples, we can find theoretically infinite “intermediate” samples—hybrid images that combine aspects of both of the samples between which they are positioned.

Digital humanities corpora, on the other hand, visual or otherwise, always exist as discrete collections of samples (and often as fragmented collections of samples as well, because many historical artifacts are lost). We argue that GAN images can reintroduce a certain manufactured continuity to such a discrete corpus, which allows study of the discreteness of the corpus itself. By reintroducing continuity to a corpus of discrete images, we are forced to precisely quantify the semantic thresholds that support its discreteness. Discrete concepts are transformed into continuous variables. What, exactly, defines a certain iconography? How far in any direction (in the literal sense of latent space hyper-directions) can we veer off until an image that is clearly recognizable as belonging to a certain iconographic tradition stops being recognizable as such? A synthetic grammar of art emerges that is not historical like Alois Riegl’s Historical Grammar of the Visual Arts, but rather is diachronic and multimodal. In a sense, if generative approaches automatically stay close to the material, using GANs would mean staying closer to the material than is actually possible by amending it.

Latent Space Projection

What this approach would require, however, is finding and operationalizing meaningful relations between real and latent space images. If we want to use latent space images as intermediate images, we need to reconnect them to the space of real images, or vice versa. And if we understand the sociohistorical context of an image as the set of those circumstances that determine, to a certain degree, both its existence and its interpretation, then the space of real images, the training corpus, becomes that sociohistorical context for these GAN images.

We propose to operationalize this “reconnection” as latent space projection. Latent space projection is an optimization technique that uses gradient ascent, similar to the way it is used in feature visualization, to optimize an input with respect to certain criteria. In the case of GANs, a latent code is optimized with respect to the perceptual similarity (Zhang et al., “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”) between the generated GAN image and the image that is being projected into the latent space. With latent space projection, thus, we can identify the latent double of a real image: the point in latent space, and its associated image, that comes closest to, but will never exactly match, a real image (Figure 5.1).

If we evaluate this idea by close-reading concrete interpolations of projected images, we immediately see how it can indeed produce insights about the corpus at large. We demonstrate this by exploring a latent space generated by a StyleGAN2-ADA network (Karras et al., “Training Generative Adversarial Networks with Limited Data”) trained for 2,000 “kimg” (thousands of real images shown to the discriminator) on approximately 1,200 derivatives and reproductions of four antique statues scraped from multiple art-historical image archives: Laocoon Group, Boy with Thorn, Farnese Hercules, and Apollo Belvedere. Starting from two projected images and their respective latent space points (using the projection algorithm supplied with the StyleGAN2-ada PyTorch reference implementation), we generate interpolations by sampling intermediate points at regular intervals with a custom-built tool.4 Figure 5.2 shows such an interpolation, between a projection of the Boy with Thorn and a projection of the Laocoon Group. Semantically relevant in this example is what could be described as the repurposing of body parts. The boy’s hat, in the first image, becomes Laocoon’s head in the last image. Likewise, the boy’s head in the first image becomes Laocoon’s upper body in the last image. In Figure 5.3, which interpolates between the latent double of a Laocoon Group drawing and the latent space equivalent of an Apollo Belvedere black-and-white photograph, we see a similar repurposing of body parts. On the syntactic level, we see that, in Figure 5.2, the gold color of the statue’s head becomes the yellowed paper in the Laocoon drawing. (The gold color can be seen in the ebook and Manifold editions of this chapter.) Likewise, in Figure 5.3, we see an increased “shading” in the drawing evolve into a three-dimensional representation in the latent space photograph.

Central section of Rembrandt’s self-portrait. — Figure 5.1. Central section of Rembrandt’s Self Portrait (1660) and a latent space double after 5000 projection iterations (dataset: https://github.com/NVlabs/metfaces-dataset; model: https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl.) The self-portrait by Rembrandt (5.1a), as well as historically adjacent portraits, are likely part of the Met Faces model, but because of the compression learned by the GAN, the latent space double can only ever be an approximation. Importantly, in the GAN latent space, an approximation (5.1b) still produces a legible image, as its solution space is a “dense” solution space that always provides a result (Offert, “Latent Deep Space”). Public domain/CC-BY-SA.

Latent space double of Rembrandt’s self-portrait. — Figure 5.1. Central section of Rembrandt’s Self Portrait (1660) and a latent space double after 5000 projection iterations (dataset: https://github.com/NVlabs/metfaces-dataset; model: https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl.) The self-portrait by Rembrandt (5.1a), as well as historically adjacent portraits, are likely part of the Met Faces model, but because of the compression learned by the GAN, the latent space double can only ever be an approximation. Importantly, in the GAN latent space, an approximation (5.1b) still produces a legible image, as its solution space is a “dense” solution space that always provides a result (Offert, “Latent Deep Space”). Public domain/CC-BY-SA.

Figure 5.2. Interpolation between the latent space double of a Boy with Thorn derivative and a Laocoon Group derivative. Public domain/CC-BY-SA.

The GAN, as apparent from these examples, has picked up on two significant aspects of the corpus. First, it has internalized the distribution of color in the corpus, which has a significant bias toward grayscale (due to the oversized presence of drawings and etchings) and yellow/gold (due to its presence on many derivatives of the Boy with Thorn and the paper color of many drawings and etchings). We can validate this by plotting the real images in the corpus using a semantic clustering tool developed by the authors.5 Initially, clustering was done according to pretrained CLIP (contrastive language-image pretraining) embeddings, then according to raw color data (Figures 5.4 and 5.5, available in the ebook and Manifold editions of this chapter). Second, it has learned the importance of a central human figure, which explains the repurposing of body parts. Both of these conclusions, obviously, can be derived from the corpus itself or from more “traditional” clustering methods. The GAN latent space, however, gives us the unique opportunity of seeing these important traits of the corpus “in action”: as actual image properties, with clear visual implications. A morphology of techniques and materials emerges. The delicate and unstable Laocoon drawing is literally set into stone. Visualizing its artificial transition, then, allows us to grasp, and visualize for ourselves, the material relations between objects in the corpus.

Figure 5.3. Laocoon Group derivative and Apollo Belvedere derivative interpolation. Public domain/CC-BY-SA.

It is important to note that, much like the technical method we are studying here, the conclusions drawn are speculative in nature: future research and further collaboration with computer scientists will have to show if GANs and similar methods can indeed be viable tools of digital art history and digital visual studies, given the unavoidable heterogeneity of cultural image corpora, which cannot be neatly aligned like the celebrity faces we see so often in GAN papers, neither visually nor historically. Despite these obvious limitations, however, generative methods could provide us with more hands-on approaches to explore image corpora, allowing us to arrange and rearrange an image corpus beyond the permutations allowed by its original elements. Mirroring Bruno Latour’s call to examine science “in action” to better understand the historical formation of scientific discoveries, the manufacturing of visual continuity would force visual relations to emerge that are usually hidden in the unbridgeable gaps between historically separate images. The speculative nature of generative methods would thus, paradoxically, provide a glimpse into a more empirical future of digital art history and digital visual studies.

Postscript: Visual Continuity in the Age of Multimodal Models

The last major revision of this chapter was submitted at the end of 2021. It is our conviction that its main argument—generative models can teach us something about the discreteness of both image corpora and image objects—still stands three years later. And yet, everything has changed. The generative models of today are not the generative models of this text: GANs, to the most part, have been succeeded by so-called multimodal models that enable the user to generate images from descriptive prompts. Multimodal models move away from the game-theoretic structure of GANs and take inspiration from natural language processing by adapting the so-called transformer architecture (Vaswani et al., “Attention Is All You Need”) to the visual domain.

The transition from one to the other, historically, began in late 2021 and early 2022, when Katherine Crowson published a series of CoLab notebooks exploring ways to use CLIP (Radford et al., “Learning Transferable Visual Models”) to guide GAN-based image generation.6 CLIP learns from pairs of images and descriptive captions. In practice, this means that CLIP’s internal representational logic is informed by both visual and linguistic relations: related visual properties exist in the CLIP latent space, and vice versa, for each “word.”7 This approach was almost immediately taken up and refined by OpenAI, eventually leading to the publication of the DALL-E 2 model in April 2022 (Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”).

Compared to GANs, multimodal models—like DALL-E 2 or Stable Diffusion (Rombach et al., “High-Resolution Image Synthesis with Latent Diffusion Models”)—afford an entirely different approach to latent space navigation. In GANs, navigation is necessarily geometric. There exists no interface that would facilitate the discovery of images other than latent space itself. To move from one image to the next, one has to literally traverse latent space on one or multiple axes. In multimodal models, on the other hand, navigation is guided by a discrete symbolic system—natural language.

This also implies that the strategic exploration of latent space becomes more and less intuitive at the same time. More intuitive because specific visual ideas are easier to “get to.” As Hannes Bajohr has argued, prompts could be understood as a kind of “operative ekphrasis.” A whole terminology of “visual words,” including references to existing visual properties (“in the style of”) enables the user of a multimodal model to navigate latent space without knowing anything about its topology. But also it is less intuitive because, once a visual idea is realized, its immediate neighbors are difficult to address. Other than in the geometric navigation afforded by GANs, where we can easily move “a tiny bit to the right,” the minimum latent space “step size” in multimodal models is that of discrete tokens. What we describe as a potential benefit of GAN latent space exploration above—to explore the in-between spaces of discrete corpora made continuous—thus seems out of reach for multimodal models.

At the same time, some new experimental methods promise to restore the kind of fine-grained control inherent in geometric approaches to latent space navigation. An architecture called ControlNet (Zhang and Agrawala, “Adding Conditional Control to Text-to-Image Diffusion Models”), for instance, reintroduces “classic” computer vision techniques as guiding principles for multimodal generative models. Techniques like edge detection, pose detection, and image segmentation allow the user to reach those parts of latent space that lie between tokens, where the generated image is just different enough to be considered “a different image” but not different enough to warrant a change in any high-level description. Even in the age of multimodal models, in other words, visual continuity continues to be a concept of interest—and thus should continue to be a field of experimentation in the computational humanities as well.

Notes

1. See, for instance, the two special issues of DHQ: Digital Humanities Quarterly on “Film and Video Analysis in the Digital Humanities” (ed. Manuel Burghardt et al.) and “AudioVisual Data in DH” (ed. Taylor Arnold et al.).
2. One could argue that this is an effect of the reduced interpretability of generative methods.
3. Approaches like CLIP and, to a lesser degree (at least for the moment), vision transformers, also indicate an increasing entanglement of language and vision in machine learning that promise to have a significant impact on the use of machine learning in the digital humanities.
4. Available at https://github.com/zentralwerkstatt/lit-ada.
5. Available at https://github.com/zentralwerkstatt/just-the-clusters.
6. For a more extensive overview of the transition from GANs to multimodal models, see Offert, “KI-Kunst als Skulptur.”
7. In practice, the transformer architecture operates on “tokens,” which can be words but also “subwords” (i.e., suffixes, prefixes, or other partials).

Bibliography

Adorno, Theodor W. Ästhetik. Edited by Eberhard Ortland. Vol. IV–3. Nachgelassene Schriften. Frankfurt am Main: Suhrkamp, 2009.
Arnold, Taylor, and Lauren Tilton. “Distant Viewing: Analyzing Large Visual Corpora.” Digital Scholarship in the Humanities 34, supp. 1 (2019): i3–i16.
Arnold, Taylor, and Lauren Tilton. “New Data? The Role of Statistics in DH.” Debates in the Digital Humanities 2019, edited by Matthew K. Gold and Lauren F. Klein. Minneapolis: University of Minnesota Press, 2019.
Bajohr, Hannes. “Operative Ekphrasis: Multimodal AI and the Text-Image Distinction.” Lecture given at University of California, Berkeley, January 23, 2023.
Bell, Peter, and Leonardo Impett. “The Choreography of the Annunciation through a Computational Eye.” Revue Histoire de l’art. Les humanités numériques: de nouveaux récits en histoire de l’art? 87 (2021): 61–76.
Benjamin, Ruha. Race after Technology: Abolitionist Tools for the New Jim Code. New York: John Wiley, 2019.
Bishop, Claire. “Against Digital Art History.” International Journal for Digital Art History 3 (2018).
Box, George E. P. “Robustness in the Strategy of Scientific Model Building.” In Robustness in Statistics, edited by Robert L. Launer and Graham N. Wilkinson, 201–36. New York: Academic Press, 1979.
Buolamwini, Joy, and Timnit Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of Machine Learning Research 81, no. 1–15 (2018): 77–91.
Crowley, E. J., and Andrew Zisserman. “The State of the Art: Object Retrieval in Paintings Using Discriminative Regions,” 1–12. British Machine Vision Association, 2014.
Da, Nan Z. “The Computational Case against Computational Literary Studies.” Critical Inquiry 45, no. 3 (2019): 601–39.
Da, Nan Z. “Critical Response III. On EDA, Complexity, and Redundancy: A Response to Underwood and Weatherby.” Critical Inquiry 46, no. 4 (2020): 913–24.
Dobson, James E. Critical Digital Humanities: The Search for a Methodology. Champaign: University of Illinois Press, 2019.
Elgammal, Ahmed, Bingchen Liu, Diana Kim, Mohamed Elhoseiny, and Marian Mazzone. “The Shape of Art History in the Eyes of the Machine.” Proceedings of the AAAI Conference on Artificial Intelligence 32, no. 1 (2018).
Geirhos, Robert, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. “Shortcut Learning in Deep Neural Networks.” ArXiv:2004.07780 [Cs, q-Bio] (May 2020). https://arxiv.org/abs/2004.07780.
Gitelman, Lisa. Raw Data Is an Oxymoron. Cambridge, Mass.: MIT Press, 2013.
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, 2672–80. 2014.
Graving, Jacob M., and Iain D. Couzing. “VAE-SNE: A Deep Generative Model for Simultaneous Dimensionality Reduction and Clustering.” BioRxiv Preprint (2020).
Horkheimer, Max. “Traditionelle und Kritische Theorie.” Zeitschrift für Sozialforschung 6, no. 2 (1937): 245–94.
Impett, Leonardo. “Open Problems in Computer Vision.” Friedrich Alexander University Erlangen-Nuremberg. YouTube Video, 24:54, March 2020. https://www.youtube.com/watch?v=zsQKFxqqTto.
Jonas, Hans. “Homo Pictor. Von der Freiheit des Bildens.” In Organismus und Freiheit. Ansätze zu einer philosophischen Biologie, 226–57. Göttingen: Vandenhoeck & Ru-precht, 1973.
Karras, Tero, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. “Training Generative Adversarial Networks with Limited Data.” ArXiv:2006.06676 (2020).
Karras, Tero, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Generative Adversarial Networks.” ArXiv:1812.04948 (2018).
Karras, Tero, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. “Analyzing and Improving the Image Quality of StyleGAN.” ArXiv:1912.04958 (2019).
Latour, Bruno. “Circulating Reference: Sampling the Soil in the Amazon Forest.” In Pandora’s Hope: Essays on the Reality of Science Studies, 24–79. Cambridge, Mass.: Harvard University Press, 1999.
Latour, Bruno. Science in Action: How to Follow Scientists and Engineers through Society. Cambridge, Mass.: Harvard University Press, 1987.
Manovich, Lev. Cultural Analytics. Cambridge, Mass.: MIT Press, 2020.
Mercuriali, Giacomo. “Digital Art History and the Computational Imagination.” International Journal for Digital Art History 3 (2018): 141.
Münster, Sander, and Melissa Terras. “The Visual Side of Digital Humanities: A Survey on Topics, Researchers, and Epistemic Cultures.” Digital Scholarship in the Humanities 35, no. 2 (2020): 366–89.
Ng, Andrew, and Michael Jordan. “On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes.” Advances in Neural Information Processing Systems 14 (2001).
Offert, Fabian. “KI-Kunst als Skulptur.” In KI-Realitäten. Modelle, Praktiken und Topologien des Maschinellen Lernens, edited by Richard Groß and Rita Jordan. Transcript, 2023.
Offert, Fabian. “Latent Deep Space: GANs in the Sciences.” Media + Environment (2021).
Offert, Fabian. “The Past, Present, and Future of AI Art.” The Gradient (June 2019). https://thegradient.pub/the-past-present-and-future-of-ai-art/.
Offert, Fabian, and Peter Bell. “Perceptual Bias and Technical Metapictures: Critical Machine Vision as a Humanities Challenge.” AI & Society (2020). https://link.springer.com/article/10.1007/s00146-020-01058-z.
Papert, Seymour. “The Summer Vision Project.” MIT, 1966. https://dspace.mit.edu/handle/1721.1/6125.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. “Learning Transferable Visual Models from Natural Language Supervision.” Image 2 (2021): T2.
Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. “Hierarchical Text-Conditional Image Generation with CLIP Latents.” ArXiv:2204.06125 (2022).
Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. “Zero-Shot Text-to-Image Generation.” ArXiv:2102.12092 (2021).
Riegl, Alois. Historical Grammar of the Visual Arts. New York: Zone Books, 2004.
Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. “High-Resolution Image Synthesis with Latent Diffusion Models.” In Conference on Computer Vision and Pattern Recognition (CVPR), 10674–85. 2022.
So, Richard Jean. “All Models Are Wrong.” PMLA 132, no. 3 (2017): 668–73.
Underwood, Ted. “Critical Response II: The Theoretical Divide Driving Debates about Computation.” Critical Inquiry 46, no. 4 (2020): 900–912.
Underwood, Ted. “It Looks like You’re Writing an Argument against Data in Literary Study . . .” The Stone and the Shell (blog), September 21, 2017. https://tedunderwood.com/2017/09/21/it-looks-like-youre-writing-an-argument-against-data-in-literary-study/.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (2017).
Wevers, Melvin, and Thomas Smits. “The Visual Digital Turn: Using Neural Networks to Study Historical Images.” Digital Scholarship in the Humanities 35, no. 1 (2020): 194–207.
Zhang, Lvmin, and Maneesh Agrawala. “Adding Conditional Control to Text-to-Image Diffusion Models.” ArXiv:2112.10752 (2023).
Zhang, Richard, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 586–95. 2018.

Maps as Data

Show the following:

Adjust appearance:

Notes

Manufacturing Visual Continuity

Studying Images with Machine Learning

Discriminative versus Generative Approaches

Generative Adversarial Networks

Latent Spaces as Continuous Image Spaces

Latent Space Projection

Postscript: Visual Continuity in the Age of Multimodal Models

Notes

Bibliography

Annotate