Digital Preservation’s Place in the Future of the Digital Humanities

The following is the rough notes for a talk I gave at the University of Pittsburgh’s iSchool. I’ll likely come back later to iron out any kinks in them, but figured I would get them up sooner rather than later so here they are. Thanks to Alison Langmead for the invitation. You can review all the sides here

Ensuring long term access to digital information sounds like a technical problem; like it could be a problem for computer scientists to solve. If we could only set up the right system we could “just solve it”. Far from it.

Digital Preservation is not primarily a technical problem

I’ve become increasingly convinced that digital preservation is in fact a core problem and issue at the heart of the future of the digital humanities.

In this talk, I will suggest how some issues and themes from the history of technology, new media studies, and archival theory, gesture toward the critical role that humanities scholars and practitioners should play in framing and shaping the collection, organization, description, and modes of access to the historically contingent digital material records of contemporary society. That’s a mouthful. In short, I think there is a critical need for a dialog and conversation between work in the digital humanities and work building the collections of sources they are going to draw from.

This is a broad topic, and I am trying to pull a lot of different strands from different fields together here. So this is going to be less a comprehensive argument and more of a survey, glancing off a range of projects and ideas that point toward the important interconnections that already exist between the digital humanities and digital preservation.

What is a Digital Historian Doing with Digital Preservation

When I tell people I am a historian and I work on digital preservation I get a lot of confused looks. What on earth is a digital historian and what does it have to do with digital preservation? I’m not entirely sure what being a digital historian entails, but as far as google image search is concerned, I’m part of the definition. (It’s my picture there in the green).

What google image search thinks digital historian looks like
What google image search thinks digital historian looks like. I’m on the grass.

But back to the point, when I mention that I do digital history and I work on digital preservation I’m often asked questions like “Isn’t that IT? Isn’t that technical? Is that like computer science? Or, library science or something?” Initially I was a bit timid, in responding to these queries. I was still finding my way through a highly technical field myself. I’d assert that understanding the born digital records of our society are in fact very important to historians. But I’ve been becoming bolder in this regard.

Trying not to Define the Digital Humanities

Yes, digital preservation is a technical field, one that requires technical skills. However, it also requires extensive technical skills in, say German to be able to be a good Art Historian studying Modern German Art. An understanding of digital artifacts should be a central part of the emergent digital humanities.

What Google Image Search's Hive Mind thinks the Digital Humanities is/are.
What Google Image Search’s Hive Mind thinks the Digital Humanities is/are.

This brings us to the second part of the title. What does digital preservation have to do with the emergent field of digital humanities. The digital humanities are different things to different people and I don’t want to spend too much time trying to define it/them. Again, in google image search’s hive mind the digital humanities have something to do with word clouds, projects, debates and logos.

Working Definitions of the Digital Humanities

In any event, I see three primary areas of activity in DH.

  1. Computational Analytic Methods: Here I’m thinking about computational approaches to studying primary sources (think here of Google’s n-gram viewer, of corpus analysis, of various and sundry ways of using computers to count things and conduct distant reading),
  2. Experimentations in the Format of Scholarship: Here I’m thinking about work on the future of digital scholarly communication and publication (new kinds of journals, about digital scholarship, projects like Ed AyersValley of the Shadow, various kinds of online exhibitions and presentations of primary sources using platforms like Omeka),
  3. Interpreting the digital record: interpreting born digital primary sources. This last area is essential to the future of the first two.

If the digital humanities is ever to study the 21tst century that study is going to be based on born digital primary sources. We need forms of digital Hermeneutics, the reflexive process of interpretation at the heart of humanities scholarship, that fit with digital texts and artifacts.

Selection and Definition: Points of Contact Between Humanists and Preservers

Importantly, there are two primary issues that humanists have a lot to offer in shaping the digital historical record. Selection and Definition.

  1. Selection: What is collected and preserved
  2. Definition: What features of digital objects are significant to preserve


We can’t count on benign neglect as a process of waiting to figure out what might matter in the future. The failure rate on most consumer grade digital media is much, much shorter than the failure rate on analog media. Further, when digital media fail it’s often complete, as opposed to being partially recoverable. To that end, there is a need for many to follow in the footsteps of projects like the Center for History and New Media’s September 11th Digital Archive, where a group of historians intervened and launched a site to crowdsource the collection of everything from text messages, emails, and other digital traces of the attacks for future historians to make sense of them. Learning lessons from areas like oral history collection, it is essential for historians to wade in and actively work to ensure that the digital ephemera of society will be available to historians of the future.

The point about selection is important, but it’s largely contiguous with current practices. Decisions about selection for collections are always fraught and contingent on the values and perspective of the collecting institution. Far more problematic, is the fact that the very essence of what a digital object is is itself contentious and dependent on the kinds of questions one is interested in.

What is Pitfall? It depends on what your research questions are.
What is Pitfall? It depends on what your research questions are.

For instance, what is Pitfall? Is it the binary source code, is it the assembly code written on the wafer inside the cartridge, is it the cartridge and the packaging, is it what the game looks like on the screen? Any Screen? Or is it what the game looked like on a cathode ray tube screen? What about an arcade cabinet that plays the game? The answer is, that these are all pitfall. However, for different people; individual scholars, patrons, users, etc. what Pitfall is is different. If humanists want to have the right kind of thing around to work from they need to be involved in pinning down what features of different types of objects matter for what circumstances.

This point is expansive, so I’ll briefly gloss it before going into depth on each of these topics. In keeping with much of the discourse of computing in contemporary society, there is a push toward technological solutionism that seeks to “solve” a problem like digital preservation. I suggest that there isn’t a problem, so much as there are myriad local problems contingent on what different communities’ value. With that said, this is not a situation of “anything goes” digital media are material, and based on inscription, a set of insights from new media studies which offers a new basis for us to develop a an approach to source analysis and criticism that has a long standing history in fields like textual scholarship


One of the biggest problems in digital preservation is that there is a persistent belief by many that the problem at hand is technical. Or that, digital preservation is a problem that can be solved. I’m borrowing this term from Evegeny Morozov, who himself borrowed the term solutionism from architecture. Design theorist, Michael Dobbins explains, “Solutionism presumes rather than investigates the problem it is trying to solve, reaching for the answer before the questions have been fully asked.” Stated otherwise, digital preservation, ensuring long term access to digital information, is not so much a straightforward problem of keeping digital stuff around, but a complex and multifaceted problem about what matters about all this digital stuff in different current and future contexts.

The technological solutionism of computing in contemporary society can easily seduce and delude us into thinking that there could be some kind of “preserve button”. Or that we could right click on the folder of American Culture on the metaphorical desktop of the world and click “Preserve as…” In fact, as noted in the case of Pitfall! defining what it is that one wants to keep around is itself a vexing issue. In digital preservation this problem is often smuggled into the notion of “significant properties.”


Chimerical Significance

The problem that is all too often swept away in technical discussions of preservation is what is to be preserved. That is, in established practices for digital preservation, like web archiving, attempting to preserve rendered content is the assumed solution. Just grab the HTML and files displayed when an HTTP request is made and then play them back in a tool like the wayback machine. With that noted, it’s critical to realize that making sense of and interpreting, performing if you will, that content is itself a complex dance involving differing ideas about authenticity.

In the case of a web page, is it its source code, or what it looks like rendered? Is it what it looks like rendered on the particular version of the particular browser it was composed to be viewed on? Is it what it looks like when it runs on a computer with a particular vintage of internal memory clock that produces part of how visual elements flicker? If you are only interested in the textual record of the site, then the text is all you need. But if you are a conservator of net art and this happens to be an important work, you may need to spend considerable time doing ticky tacky work to ensure that the work retains it’s fidelity to it’s creators intent.

To make this a bit more concrete, we can turn to a small corner of a now extinct neighborhood in Geocities. For those unfamiliar, Geocities was an early online community which Yahoo! turned off in 2009. Due largely to the work of ArchiveTeam, a self described group of rogue archivists, much of Geocities was collected and distributed. Looking at a small sliver of that archive can underscore some of the issues at the heart of the problem of preserving and accessing this kind of material.

Geocities page viewed through the Internet Archive's Wayback Machine
Geocities page viewed through the Internet Archive’s Wayback Machine
Same Geocities site as presented in One Terabyte of the Kilobyte Age.
Same Geocities site as presented in One Terabyte of the Kilobyte Age.

Here are two images of archived copies of a spot in the Capitol Hill neighborhood of Geocities. This first one is what it looks like rendered on my browser at work. This second one, is what it looks like as presented in One Kilobyte of the Terabyte Age. Created by Olia Lialina & Dragan Espenschied. One Terabyte of Kilobyte Age,  is in effect a designed reenactment of geocities grounded in an articulated approach to accessibility and authenticity which plays out in an ongoing stream of posts to a tumblr account. Back to the two images: Note that the header image is missing in the first one, as displayed in my modern browser. The image is still there, but my browser isn’t doing a good job at creating a high fidelity presentation of what the site should look like.

The point is, that you can’t just “preserve it” because the essence of what matters about “it” is something that is contextually dependent on the way of being and seeing in the world that you have decided to privilege. In the case of something like Geocities, it turns out that there are a bunch of different decisions one can make about fidelity and authenticity and different collections are taking different approaches.

Dragan's take on the trade offs inherent in different approaches to authenticity and accessibility for preserving webpages.
Dragan’s take on the trade offs inherent in different approaches to authenticity and accessibility for preserving webpages.

Dragan’s vision for the presentation is anchored in this continuum of authenticity and accessibility across the entire stack of technologies at play in the presentation of a web page. That is, One Kilobyte of the Terabyte age is a kind of critical edition (a mainstay as a scholarly product) of geocities. Unlike many other web archiving projects, Dragan is very upfront about what it is that he has decided to privilege and focus on in this special collection or critical edition of geocities. The resource he has created here is both an interpretation and a point of access into some of the most significant properties of Geocities that might otherwise be lost.

In short, deciding what it is that one want’s to keep is vexing and problematic, with that said, it is critical to note that we do actually have something to hang on to here. There is in fact a there there when it comes to digital objects. Further, the work of humanities scholars to understand the fundamental forensic and textual traces of digital objects points the way forward to a hermeneutics, an interpretive approach to understanding and studying digital primary sources. The most essential work in this area is Mathew Kirshenbaum’s work in Mechanisms: New Media and the Forensic Imagination.

Materiality & Inscription

We all know that digital media is binary, that somewhere there are screens of ones and zeros doing something like in the Matrix.


The binary essence of digital media, the one’s and the zeros of it all, are in fact texts. Inscribed at the limits of augmented human perception, the sequences of bits on a hard drive are still very much material. Inscribed in the sectors of a disk are files in formats intended to be read and interpreted by different pieces of software, software which is itself inscribed on different pieces of storage media. The point here is that the longstanding traditions of studying texts, of interpreting them, have a home at the basic root level of digital objects which are both sequences of textual information and material culture visible in magnetic flux transitions on disk or the pits on optical media.


The structures of this media share an affinity with a strand of archival theory too.

Media and Data Structures as Fonds

Whatever your feelings about the imperative of the archivist to Respect Des Fonds, the imposition to maintain original order and to pay attention to provenance of materials, it remains a cornerstone of the identity and professional practice of archives. Attempting to maintain the original order in which materials were managed before being accessioned and making decisions when processing an archive with respect to the whole both suggest a kind of archeological or paleontological understanding of documents, records and objects. An Object’s meaning is always to be understood in context of the objects near it and the structure it is organized in.

In the analog world, it’s often difficult to infer what that order is. For instance, the Herbert A Philbrick papers came to the Library of Congress in a mixture of boxes and trash cans.


Contrast that with the order of a floppy disk from playwright John Larson’s papers. Irrelevant of his own strategies for organizing his data, and his .trashes, the computer saves and stores information like the time he last opened the files. (For more on this example, see the work of Doug Reside, Digital Curator for the Preforming Arts New York Public Library)


The logic of digital media, of data structures, is one of order. Even if a user tries to eschew that order, the machine insists on creating, storing and retaining all manner of technical metadata and time stamps.

The order of bits on a disk, the structure of files in a file system, the organization and structure in of data available from an API are each fonds like. Data and records accrue according to the process and logic of digital media. Just as the structure and organization of records and knowledge in the analog world says as much about the materials as what is inside them so is the same true in the digital. The layers of sediment in which something is found enables you to understand its relationship to other things. Context is itself a text to be read.

With this noted, other humanities scholars, have clarified that all too often we privilege one mode of reading that underlying data structure. Our knee jerk reaction is that what is significant about an digital object is what it looks like or does on the screen.

Screen Essentialism

Digital objects are encoded information. They are bits encoded on some sort of medium. We use various kinds of software to interact with and understand those bits. In the simplest terms software reads those bits and renders them. However, the default application for opening a file isn’t the only way to go about it. You can get a sense of how different software reads different objects by changing their file extensions and opening them with the wrong application.

For example, if you just change the file extension of an .mp3 to .txt and then open the file up in your text editor of choice, you can see what happens when your computer attempts to read an audio file as a text. Slide24

While this is a big mess, notice that you read some text in there. Notice where it says “ID3″ at the top, and where you can see some text about the object and information about the collection. What you are reading is embeded metadata, a bit of text that is written into the file. The text editor can make sense of those particular arrangements of information as text.


Here is an.mp3 and a .wav file of the same original recording changed to a .raw file and opened in Photoshop. Look at the difference between the .mp3 on the left and the .wav on the right. What I like about this comparison is that you can see the massive difference between the size of the files visualized in how they are read as images. Notice how much smaller the black and white squares are. It’s also neat to see a visual representation of the different structure of these two kinds of files. You get a feel for the patterns in their data.

These different readings or performances of a file aren’t particularly revelatory, except to underscore that the very act of opening a file, of seeing its contents is a process of interpretation a text. The sequence of 1’s and 0’s is enacted in front of us by software. Formats and software are themselves essential actants in this performance which other humanities scholars have done great work to help us understand.

Format and Medium in Platform Study

In a detailed study of the Atari 2600, Nick Montfort and Ian Bogost suggest that the study of software inevitably involves the study of layers of software on top of software intertwined with particular pieces of hardware. For example, the tiny amounts of RAM in the 2600 resulted in a complicated problem for programmers to display graphics. They extensively discuss the game Pitfall, so we can return again to its example.

Illustration from Montfort and Bogost's Racing the Beam
Illustration from Montfort and Bogost’s Racing the Beam

This illustration shows what the game screen looks like from inside the system. Note what we see on the screen, the area with the fellow swinging there, is really just a small portion of how the game thinks of its screen. The three large areas (vertical blank, horizontal blank, and overscan, are actually where the computations necessary for keeping score and working through the game are done. In this case, being able to understand how a game like Pitfall was innovative is intimately connected to being able to actually understand the relationship between the game’s functionality and the underlying constraints of the Atari Platform. For those interested in presentation it further complicates the idea of collecting and preserving such an artifact as a more nuanced understanding of the platform continues to reveal important, seemingly hidden, characteristics of its nature.

Going forward, Bogost and Montfort’s notion of “platform studies” should be come increasingly important to those working to preserve digital artifacts.

From their perspective, the layers in these platforms provide particular affordances and constraints but are generally taken for granted by users as a part of the platform. In this case, Platform could be anything from a piece of hardware, like the 2600, a programing language like c++, Java, or Python, or a format, like MP3, or .gif, or a set of protocols, like HTTP and the DNS, or something like Adobe Flash that provides a language and runtime environment for works.

I’ll quote Montfort and Bogost’s explanation of platforms here at length as it is particularly pertinent.

By choosing a platform, new media creators simplify development and delivery in many ways. Their work is supported and constrained by what this platform can do. Sometimes the influence is obvious: A monochrome platform can’t display color, a video game console without a keyboard can’t accept typed input. But there are more subtle ways that platforms interact with creative production, due to the idioms of programming that a language supports or due to transistor-level decisions made in video and audio hardware. In addition to allowing certain developments and precluding others, platforms also encourage and discourage different sorts of expressive new media work. In drawing raster graphics, the difference between setting up one scan line at a time, having video RAM with support for tiles and sprites, or having a native 3D model can end up being much more important than resolution or color depth.

The point is as follows, the nested nature of platforms, their ties in and out of software and hardware and culture are the essential problem of digital preservation and a key question for anyone interested in long term access to our digital records to grapple with. Our world increasingly runs on software and hardware platforms. From operating streetlights and financial markets, to producing music and film, to conducting research and scholarship in the sciences and the humanities, software platforms shape and structure our lives. Software platforms are simultaneously a baseline infrastructure and a mode of creative expression. It is both the key to accessing and making sense of digital objects and an increasingly important historical artifact in its own right. When historians write the social, political, economic and cultural history of the 21st century they will need to consult the platforms of our times. As underscored already, even defining the boundaries of such works is itself a fraught and interpretive project. For this reason alone I firmly believe that digital preservation is a primary challenge which should pique the interest of digital humanists.

To recap, in work on the materiality of digital objects, in conceptions like screen essentialism, humanists are already providing critical information for those interested in collecting and preserving the digital record.

Example’s like Dragan’s work with Geocities illustrate how there is considerable value in closer collaboration here, where scholars actually dig in and create special collections or critical editions of digital records to clarify the perspective taken in their collection.

Aside from this, I think there is one other key reason that digital primary sources should cry out for the attention of digital humanities.

The Born Digital Record is Already Computable 

When I opened my talk, I noted that to many, the digital humanities is synonymous with computational approaches to studying texts. Importantly, coming around from the other side of this, consideration of digital primary source for digital preservation, we end up with far, far, far more computable data then the digitized corpora of historical texts which occupy many of those interested in doing computational research in the humanities are working from.

Where historical works must be digitized, born digital media is by definition already computable. That is, when we gather together aggregations of data, be they web archives, aggregates of selfies from instagram, or corpora of files from software packages, they are already computable.

In a talk about working with web archives, Historian Ian Milligan stated the problem concisely.

If history is to continue as the leading discipline in understanding the social and cultural past, decisive movement towards the digital is necessary. Every day most people generate born-digital information that if held in a traditional archive would form a sea of boxes, folders, and unstructured data. We need to be ready.

In short, the future of the computational humanities is itself going to be turning to the increasingly heterogeneous digital fonds, data sets, data dumps, corpora of software and images and logs of transactional data.


The Praxis of Digital Preservation

Dialog with areas of work in the humanities are all essential to the future of digital preservation.

What we need is a generation of conservators, archivists, and historians with extensive technical chops who realize just how contingent and complex deciding what bits to keep and how to go about keeping them is.

Digital objects, artifacts, texts, and data are something more than “content” they are the material anchors, the primary sources, through which we can interpret, critique, and understand our society.

I firmly believe that ours should be a golden age for born-digital special collections, archives, troves and critical editions. The future of digital preservation is less about defining a hegemonic set of best practices, than it is about scholars, curators, conservators and archivists working together to define what it is that they value about some kind of digital content and to then go out and collect it and make it available for use to their constituencies. It is about setting definitions that are often at odds with each other but that are coherent toward their own ends.