Where to Start? On Research Questions in The Digital Humanities

How should digital humanities scholars develop research questions? Spurred on by this recent conversation on twitter, I figured I would lay out a few different ways to go about answering this question about questions. The gist of the dialog is that Jason Heppler suggested that one should “Fit the tool to the question, not the other way around” in terms of working with various kinds of new digital humanities tools. I take tools here to mean any computational instrument employed to understand the world; for examples GIS, topic modeling, creating simulations using cellular automata or agent based models, analyzing frequencies of audio files, or visualizing trends in images. I get where Jason was going, but at least as it was formulated I don’t think it is the right advice.

The conversation prompted me to try and clarify a bit of how I see the relationship between research questions, primary sources, and tools and methods.

Start with the Question, the Archive or the Tool?

Some historians start with their question, some start with a familiarity with a period that suggests that exploration of a particular archive or collection of primary resources could answer. Here are two examples I can recall from colleagues who I worked with doing research in the history of science.

One colleague was aware of the shift that had occurred between classical and modern physics in one astronomer’s work, documented in a recent essay. So he went to look at the papers of another astronomer, which had not yet been particularly well explored, to see if similar or different responses to the notion of a distinction between classical and modern physics had emerged in that astronomer’s work. In short, it was largely about abstracting the results of one exploration into the information available in another individuals archive.

In either case, it’s a bit of a dance between formation of questions and the ways that those questions open up or shift and change as one gets into the complicated, rich and vast space of the possibilities of primary sources.

The Function of Research Questions in History/the Humanities

Back up a bit. What is the purpose of research questions in the humanities? I would posit that the purpose of them is to clarify what is in and out of scope in a project. To define where a project should start and end. Lastly, research questions provide a constant point of reference to check back on when working on a project. You write down your questions as you go, and you can always pull them out again and check to see if, in fact, you are actually working to answer them or if you have drifted off to some other problem. Research questions are useful structures to organize your work and inquiry and they are valuable tools for signifying to others what to expect from a piece of scholarship. Research question are functionally an attempt to establish the set of criteria by which a piece of scholarship should be evaluated.

The Problem of Research Proposals and Fancy Writing

One of the big problems in talking about research questions is that one often describes research questions and methods in research proposals (for grants or dissertations etc.), and those proposals are often really a form of what Joe Maxwell calls “fancy writing.” That is, those kinds of research proposals are more about the performance of demonstrating how smart you are and why you should be given permission to do work than they are about actually trying to get research done. If you haven’t read it, I can’t recommend Joe’s Qualitative Research Design: An Interactive Approach strongly enough. In focusing on the actual purpose of research design and not the performance of proposal writing he cuts through a bunch of the fancy stuff to get to the way that research questions actually develop and evolve. He calls it an interactive approach, but I think iterative would be just as descriptive.

In Maxwell’s approach, there are five components of research design as it is actually practiced.

  1. Your goals (the reason you are doing the research),
  2. Your conceptual framework (the literature you are working in, your field, your experience that you draw from),
  3. Your research questions (a set of clear statements of exactly what you are studying)
  4. Your methods (broadly conceived as the way you are going to answer the question, so for historians both the archives/sources you will work from and their perspectives are relevant as well as the way you will sample/explore them, and the actual techniques you will use to analyze and interpret them)
  5. The validity concerns and threats (literally, answers to the question “how might you be wrong” where you work through inherent limitations and biases in your methods, sources, perspective, etc.)

The diagram below illustrates how 5 components of design interact

Illustration of how research questions should be itteritivly defined and developed in relation to purpose, conceptual framework, methods, and validity threats.

Illustration of how research questions should be iteratively defined and developed in relation to goals, conceptual framework, methods, and validity threats. From Maxwell 2014

The main point of the diagram, is that your research questions should be iteratively revised and refined throughout the work based on all the four other things that you are working on.

So… research questions aren’t something you state and then follow through on, they are best thought of as statements about your inquiry that are iteratively refined through the process of defining what you are working on.

Generally, the way that research questions are stated in quantitative research is bogus, or at least, bogus in terms of the way that people who do more qualitative research think of research questions. That is, you do a lot of work and scholarship before you can ever formulate a hypothesis that you can test. In that case, you end up with a research question at the end of an exploration not at the front of it.

Tools, Archives, & Research Questions are Inherently Theory Laden

Getting back to the issue of questions, tools, and sources; being good humanists, it is worth leaning back to grok that all method is theory laden. That is, every attempt to answer a question comes with inherent theoretical assumptions about the problem and limitations in what that method can provide in terms of answers. This is true of method broadly conceived; every method for collecting sources/evidence, the original intent by which records and sources are collected create silences, identifying a problem, interpreting sources, composing and reporting on results, all of that, comes with some inherent biases.

That is, all tools, all archives and all research questions are in and of themselves instrumental. We use them in an attempt to understand the world. That is they all serve as lens like tools reflecting and refracting back information in a tool like fashion. I’ve always liked the way that Umberto Eco explains this in Kant and the Platypus as a core concept in hermeneutics; we make interpretations but the underlying reality of existence exerts the force to resist some of those interpretations by simply saying “No” by making it clear that an interpretation can be refuted. A hermeneutics of data that emerges through the use of tools.

So where to start? Start wherever, as long as where you start is anchored in your goals. The hermeneutics of screwing around is itself invaluable. A technique of messing with tools and datasets at hand may well surface interesting patterns that no one would have found if they were working at sources in a another fashion. Pick and archive and find the questions. Or, just start with your questions and work it that way. Whatever you do, realize that it’s an exploratory process.

What matters most in where you start is your actual goals in doing the research. That is, why is it that you are actually doing your work? What is it that you hope your work will potentially do. Don’t confuse your goals with what you are interested in, realize and recognize that your goals area about the purpose of your work. If you want to do work that ultimately helps to understand and give voice to the voiceless then you likely don’t want to start messing around with the text of inaugural presidential speeches. If you want to figure out new kinds of things that can be done with topic modeling then you would presumably want to start with some sources that are in a form or close to a form that you can topic model.

Thanks to Thomas Padilla and Zach Coble who reviewed and provided input on a draft of this post.


Posted in Uncategorized | 4 Comments

Digital Archivists: Doing or Leading the Digital?

I’ve been enjoying Jackie Dooly’s recent series of posts looking at the skills and duties that are showing up in job postings for digital archivists.  I’m excited to see archives listing these. Staffing up illustrates how the issues of electronic records have risen to a significant issue in the minds of the deciders.

Like many who share this particular job title, I have some complicated feelings about the idea of “The Digital Archivist.” While my official job title is Digital Archivist, I’ve generally added a caveat. When I encounter someone else with that title, I often go on to explain that I’m more of a meta-digital archivist. That is, most of what I do is about policy, strategy, and standards; establishing and documenting practices, and collaborating to document and codify emerging practices. However, I’m becoming increasingly convinced that most of what I do is actually largely what digital archivist jobs should be doing.

I think the confusion about what a digital archivist should do is mostly summed up as follows;

Digital archivists should not the people who do the digital stuff. Everybody (including the digital archivists) need to pick up the skills necessary to work with digital records. Instead, digital archivists should be the people who are hired to lead the digital stuff.

I will elaborate on what I mean by this a bit more. I think my main issue with the idea of the digital archivist role is that I want to answer yes to two questions that some folks might imagine to be directly opposed to each other.

Should all archivists be able to work with digital materials? Yes. In this sense, all archivists must become digital archivists. It’s just a part of ongoing professional development. Digital records are not a niche area of material. Digital records are increasingly just a part of the materials archivists need to be able to process. I think some of Rebecca Goldman’s  tweets on this subject illustrate the point. Other fields haven’t hired digital waitstaff, digital nurses, digital journalists, or digital lawyers to deal with the challenges of professional development around technology in their fields.

Screen Shot 2014-06-12 at 11.44.35 AM

Then, does it make sense to have digital archivists as digital specialists? Yes. While everybody needs to have a basic capability, it does make sense to be cultivating leaders and specialists. In this sense, I think the digital archivists jobs are best thought of as having someone who devotes their time to continually 1) figuring out and refining digital process, workflows and tools, and  2) teaching the rest of the staff the techniques and processes they are developing. This means ideally digital archivists straddle a leadership and practice role.

Ongoing Leadership in Digital Work:  Ideally, we all become educators in this future because the only likely thing to stay constant is going to be change. We aren’t going to just establish the new “digital” practices and be done with it. The nature of digital technologies are continually shifting dramatically. That is, the shift from storing information on devices to thin client cloud set ups is frankly has big as the shift from paper to hard drives. The first sixty years of digital technologies has illustrated that there is every reason to believe that the technological mediums and nature of records  will continue to evolve frequently and we are going to need responsive practices to continually evolve with them.

An example from a different field:  I think we can look to the idea of the “School Based Technology Specialist” (SBTS) role as a way to think about this. Instead of hiring someone to be the “computer person” for each of the schools in Fairfax county school district the district created the SBTS role. The idea being that across the schools teachers need to be making better use of computing technology. So it’s not about hiring someone to be the computer person but hiring someone who is functionally an administrator to build capacity for teachers to incorporate digital technology into their practice.

In this vein, SBTS are described as trainers, liaisons, managers, troubleshooters, consultants and collaborators. I think the parallels to the digital archivists role are rather clear. Now, schools and archives are still rather different, so it doesn’t necessarily map over straight away. But still, I think the parallels are meaningful. The digital archivist role can be thought of as a leadership role for establishing practice. I think organizations would do best to think of how digital archivists can be empowered and given the authority to lead work on digital materials.

Curious for others’ thoughts on this.

Posted in Uncategorized | 4 Comments

Mecha-Archivists: Envisioning the Role of Software in the Future of Archives

The Cybermen, exemplify our worst fears about the future of technology. People literally turned into machines replaced and ruled by machines. I think this is the face of a fear of a technological future of archives.

The Cybermen, exemplify our worst fears about the future of technology. People literally turned into machines replaced and ruled by machines. I think this is the face of a fear of a technological future of archives.

I had the privilege of participating in The Radcliffe Workshop on Technology and Archival Processing a few weeks back. I was thrilled to be on a great panel with some early career historians and Maureen Callahan.

Maureen posted her talk The Value of Archival Description Considered online. I encourage you to read it. It’s super good. I was thrilled to find that, I think we are on nearly the exact same wavelength about the future of the finding aid.

There was a nice write up about the event in the Harvard Gazette. I won’t deny that I may be “a millennial who displayed affection for the word “awesome” during the panel.” However, there are some clarifications I should make.  I did not talk about obeying “cyborg overlords”, or a “mechanized shirt of armor.” In sharing some of the points of my talk I thought it would be good to focus in particular on parts of these clarifications. I think getting the language right about the future of our relationships with software is important, so here goes.

Maureen Welcomed the Robot Overloards, but with good reason!

Maureen had a few great lines in her talk (again, if you haven’t read it go do so now). One of those lines was her take on a Simpsons quote, “I for one welcome our robot overlords.” She went on to explain, in an even better line, “I don’t think that archivists are just secretaries for dead people, and I welcome as much automation as we can get for this kind of direct representation of what the records tell us about themselves.” I love this quote. When I was sitting there listening to her I was nodding so much. This is exactly the sentiment I wanted to get at.

The future of digital tools for archives is not replacing the work. It is automating the parts of the work that are not the intellectual labor. Along with that, the future of these tools is largely about taking advantage of the affordances in the nature, structure and order of digital media which give us considerable power to scale up our actions and interventions in the record.

I took the key theme from her pitch to be something like, let the algorithms and digital tools do the repetitive and less intellectual labor of the archivist, and get the archivist more involved in the intellectual labor of the archives. Specifically, in better contextualizing, explaining and describing the provenance of collections and making the decisions that require the kind of sophisticated judgment that people have and exercise. Without knowing where she was going, I touched on several similar themes in my talk. Ideas and visions of the labor relationship between the archivist of the future and the algorithms, scripts and tools that work for her and do her bidding.

Robot Overloards

The welcoming of Robot overlords

We get to wear the robots!

This lego mecha exo-suit is the vision I think we want for the future of digital tools in archives. Here, this mechanized power armor gives the Archivist super powers. Forget lifting a 30 lb box, in this suit you could move whole collections with ease. But that’s aside from the point. This kind of power tool lets you do a lot of the laborious parts of the work and get back more quickly to the intellectual labors.

This lego mecha exo-suit is the vision I think we want for the future of digital tools in archives. Here, this mechanized power armor gives the Archivist super powers. Forget lifting a 30 lb box, in this suit you could move whole collections with ease. But that’s aside from the point. This kind of power tool lets you do a lot of the laborious parts of the work and get back more quickly to the intellectual labors.

So we don’t want the dark vision of the robot master. We certainly don’t want the machines turn us into into the Borg or Cybermen, who lose their souls as they are taken over by the emotionless machine.

My vision for the future of the archivist using digital tools is less Borg and more Exo-suit.

The idea of mecha or exo-suits, illustrates a vision of technology that extends the capabilities of it’s user. That is, the kinds of tools I think we need going forward are exactly the sort of thing that Maureen was talking about. Things that let us automate a range of processes and actions.

We need tools that let us quickly work across massive amounts of items and objects by extending and amplify the seasoned judgment, ethics, wisdom, and expertise of the archivist-in-the-machine.

Fondz as a Tool Thought Experiment for Automation

I was recently working with some archivists who had a project where they had nearly 400 floppy disks containing drafts of letters, books, essays, etc. In short, digital copies of all the kinds of things you find in a collection of someone’s personal papers. I hope to write about that project in more detail in the future, but for now I just wanted to talk a little about a tool that got cooked up in the process. So, what can you do with some 19,000 documents like this? Now, you can learn a ton about a set of digital files by extracting and identifying them in automated processes. That is, what kinds of files they are, their file names, size, etc. It’s really useful data! However, in most cases, this is not at all the data that a researcher or other user who might work with the collection would want. Inevitably, users want to know where information related to x, y, or z is in a collection. That is, users care about topics and subjects, and the kinds of tools most of us have at hand don’t really do much with that.

Here you can see some of the very basic kind of information that is relatively easy to get at with existing tools, numbers of files, their size and their formats. This image shows the files processed and presented by Fondz in a particular test set come from 379 bags (in this case each bag contains a logical disk image). Collectively this includes 18,414 files in 49 formats.

Here you can see some of the very basic kind of information that is relatively easy to get at with existing tools, numbers of files, their size and their formats. This image shows the files processed and presented by Fondz in a particular test set come from 379 bags (in this case each bag contains a logical disk image). Collectively this includes 18,414 files in 49 formats.

To this end, I asked my colleague Ed Summers a while back if it would be possible to  strip out all the text from these documents, topic model it, and then use the topic models as an interface to the documents. In response, he cooked up a tool called Fondz.

For those unfamiliar, the MAchine Learning for LanguagE Toolkit (MALLET) describes topic modeling as follows. “Topic models provide a simple way to analyze large volumes of unlabeled text. A “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.” In this case a tool like MALLET can quickly look across a large collection of texts and identify topical clusters of terms that appear near each other.

How Edsu describes Fondz on github.

How Edsu describes Fondz on github.

I really like how Ed describes Fondz, so I’ll share it here.

fondz is a command line tool for auto-generating an “archival description” for a set of born digital content found in a bag or series of bags. The name fondz was borrowed from a humorous take on the archival principle of provenance or respect des fonds. fondz works best if you point it at a collection of content that has some thematic unity, such as a collection associated with an individual, family or organization.

Example of the Fondz topic driven interface to documents in an archival collection

Example of the Fondz topic driven interface to documents in an archival collection

Above, you can see an example of Fondz in use. This is a list of the topics that Mallet identified, in each case you see the number of documents associated with the topic on the left and in the blue box you see the terms which Mallet has identified as being associated with that topic. That first one, with 776 documents, ends up being a cluster of files versions of biographical notes and CVs, the third one, with 309 topics, is materials related to a novel and a film adaptation of that novel. Mallet doesn’t know what those topics are. It just sees clusters of terms. Based on my knowledge of the collection, I’m able to identify and name those clusters.

The result of all this is a topical point of entry to explore 19,000 digital files from hundreds of floppies. It would work just as well for OCR’ed text from recent typed and printed text. I can’t show it to you in action because I don’t have a test collection that I can broadly share. (Note, anyone who has a similar collection they can broadly share contact me about it) But take my word for it. You click on one of those topics and you see a list of all the files that are associated with it and if you click on the name of one of those files you end up seeing an HTML representation of all the text inside that file. Alongside this, a future idea would be to integrate tools that do things like Named Entity Extraction (NER) to identify strings of text that look like names of people, places and locations. Indeed, there are already attempts to use NER for disambiguation in cultural heritage collections. What is particularly important here is not that we build tools that do this “right” but that we find and use tools that make things that are “good enough” in that they are useful in helping people explore and find things in collections. This isn’t about robots just doing all the work. It’s about extending and amplifying our ability to make materials available to users in ways that help them “get to the stuff.” Aside from that, there is a need to provide users with information on what actions were preformed on the collection to make it available. To that end, it’s exciting to realize that we can simply document what tools were used so that anyone can explore the potential biases of those tools in how they create interfaces to collection data.

So what does this all have to do with cyborgs and mecha? What is in some ways most interesting to me about topic modeling is that the topics themselves are actually somewhat arbitrary and meaningless. A topic in MALLET isn’t so much a topic in regular parlance as it is just a cluster of words that tend to appear together. It takes someone who knows the texts to make sense of those topics, to fiddle with the dials till they get topics that seem hang together right (in MALLET you pick how many topics you want it to look for). So Fondz will be far more useful when it integrates processes for archivists to exercise their expertise and their judgment and intervene. When they can name the topics and describe them. When they can accept or reject some of the topics, when they can rerun them.

Since the goal here is to make useful descriptions there is a potential here for topic modeling to be used instrumentally to surface connections for an archivist to find useful or not useful and to save the useful ones and describe them. Given that good processing is done with a shovel, not with a tweezers it is exciting to think about how tools like Fondz could integrate a range of techniques for computational analysis of the content of files to act as steam shovels; instruments that put the archivist in the driver’s seat to explore and work through relationships in collection materials and expose those to users.

There are a bunch of other cleaver things that Ed is doing with Fondz that warrant further discussion, but for the purpose of this post that does it. As far as take-away messages go, I’d suggest the following. The future of digital tools for digital archives is not about tools that “just work.” It’s not about replacing the work of archivists with automated processes, it’s about amplifying and extending the capabilities of an archivist to do cleaver things with somewhat blunt instruments (like topic modeling, NER, etc.) that make it easier for us to make materials accessible. Given that the nature of digital objects is a multiplicity of orders and arrangements, if we can generate a range of relatively quick and dirty points of entry to materials we can invest more time and energy in making sure that when someone gets down to the item they have breadcrumbs and information that situates and contextualizes the item in it’s collection and it’s custodial history. We need archival-mecha, tools that give archivists superpowers by amplifying their judgment, wisdom, knowledge, ethics and expertise in working with digital materials. We need to make sure we are getting the computers to do what computers do best in supporting the praxis of archival practice.


Posted in Uncategorized | 4 Comments

Einstein as Science Santa: Monumental Meanings & Wil Wheaton

Recently, Wil Wheaton posted a picture and quote on twitter and his blog (That time I met Albert Einstein) making use of the Albert Einstein Memorial at the National Academy of Science. It’s great, he is sitting on Einstein’s lap, making requests to Einstein as a kind of physics Santa. I really love how the post, and all the likes and favorites it has gotten reinforces a set of points I made about the memorial in my essay Tripadvisor rates Einstein: using the social web to unpack the public meanings of a cultural heritage site.wheaton-eisntein

I love the way this photo fits with much of the informal and playful ways that other photos of the monument work. Here is a bit of some of what I wrote in the piece on some photos of the monument on Flickr. The images are from Flickr, and the quotes from Yelp reviews of the memorial.

Most monuments in the area establish a kind of formality between visitors and the monument. Many are constructed to physically remove the subject from the reach of visitors. Others, like the nearby Lincoln Memorial establish this formality through written rules about respectful behaviour, and a request for hushed voices. Nearly all of the reviews (17 of 21) focus on elements of the informality of the monument as a key component of what makes it enjoyable. The reviewers tell us to “climb all over ‘Al’” or as another suggests “sit on his lap, or kiss his cheek”. On Flickr, photographers have captured this in images of visitors picking and rubbing his nose, kissing him, or in a few cases arguing with him. While there is no posted notices which suggest that it is ok to climb him, if you stop by the monument on any summer day you will witness a queue of visitors waiting to climb up on him and have their picture taken.

An example of how groups of tourists use the memorial to stage group photos

The pictures are themselves an important element in this experience. The image above provides an example of one of the most popular kinds of images of the memorial posted on Flickr. As one reviewer notes, “everyone needs at least one picture of themselves sitting on “Al’s” lap”. As you can see from the photograph, the scale and size of the monument makes it work as a space for staging photos. The monument is so photogenic that one reviewer suggests that it “just begs you to go sit on Uncle Al’s lap and get our picture taken”. For these reviewers a central part of the experience is the informality that the monument provides. It invites them to climb him, and leave with photographic evidence of them sitting on the world’s most instantly recognisable scientist. While everyone has photos of themselves standing in front of the Lincoln Memorial these reviewers believe “Your tour of the Mall is not complete” without having your picture taken on Einstein’s lap.

Photo: Schmidt, C., 2008. Arguing with Einstein, Available at: http://www.flickr.com/photos/chrisbrenschmidt/2190660089/

It is worth taking moment to reflect on how some of the previous quotes refer to Einstein. The informality of these experiences is further communicated through a persistent use of his first name, or in some cases the diminutive form of his name, Al. This is itself a frequent component of these reviews. In using his first name, or calling him ‘Al’ the reviewers are communicating and playing with the informality of the memorial. The pervasiveness of this informality may be best evidenced in the recollections of a college student from a nearby university who ‘spent a lot of time just hanging out with ‘Al’’. The informality of the space and the fact that it is climbable leads many reviewers to discuss how it is a perfect place to bring kids. Many of the photos of the monument on Flickr show young children climbing all over him.

This level of informality is not something that all the reviewers think is necessarily a good thing. One reviewer suggests “most of the neat stuff was totally ignored by all the kids using the statue as a playground”. This reviewer goes on to suggest that the other elements in the composition of the statue, the quotations, and the map of the stars at his feet go unnoticed. From his perspective, visitors were “just jumping around”. He felt that “no one learned or read about the man memorialised”. This reviewer further suggests that it is ‘disrespectful’ to climb all over the monument, particularly, when there is no clear indication that touching or climbing the memorial is officially sanctioned by the sculptor or the National Academy of Sciences. There is defiantly credence to the questions the reviewer raises. To what extent are these visitors leaving with an understanding of the intentions behind the memorial? Certainly some visitor’s suggestions that “You can climb on the damn thing and stick pennies up his nose” take on a disrespectful tone. However, that is itself an interesting point of tension in the idea of Einstein. The more recently constructed Franklin Delano Roosevelt Memorial, which is built on a scale that would allow one to climb on him, does not invite the same kind of interaction. Popular notions of Einstein as an informal figure have translated into how people interact with the memorial. The relaxed experience Berks found in sculpting the memorial from life is very directly translated into visitor’s comments about the informality and relaxing nature of the experience of the monument.

This is just a way of saying, for those of us interested in public memory and the role of memorials really need to be watching the ways that people make use and sense of them on social media. At this point, our experiences of these spaces are increasingly going to be seen through the lens of the tweets, reviews, and photos that others have taken and shared and commented on them.

Posted in Uncategorized | Leave a comment

Digital Preservation’s Place in the Future of the Digital Humanities

The following is the rough notes for a talk I gave at the University of Pittsburgh’s iSchool. I’ll likely come back later to iron out any kinks in them, but figured I would get them up sooner rather than later so here they are. Thanks to Alison Langmead for the invitation. You can review all the sides here

Ensuring long term access to digital information sounds like a technical problem; like it could be a problem for computer scientists to solve. If we could only set up the right system we could “just solve it”. Far from it.

Digital Preservation is not primarily a technical problem

I’ve become increasingly convinced that digital preservation is in fact a core problem and issue at the heart of the future of the digital humanities.

In this talk, I will suggest how some issues and themes from the history of technology, new media studies, and archival theory, gesture toward the critical role that humanities scholars and practitioners should play in framing and shaping the collection, organization, description, and modes of access to the historically contingent digital material records of contemporary society. That’s a mouthful. In short, I think there is a critical need for a dialog and conversation between work in the digital humanities and work building the collections of sources they are going to draw from.

This is a broad topic, and I am trying to pull a lot of different strands from different fields together here. So this is going to be less a comprehensive argument and more of a survey, glancing off a range of projects and ideas that point toward the important interconnections that already exist between the digital humanities and digital preservation.

What is a Digital Historian Doing with Digital Preservation

When I tell people I am a historian and I work on digital preservation I get a lot of confused looks. What on earth is a digital historian and what does it have to do with digital preservation? I’m not entirely sure what being a digital historian entails, but as far as google image search is concerned, I’m part of the definition. (It’s my picture there in the green).

What google image search thinks digital historian looks like

What google image search thinks digital historian looks like. I’m on the grass.

But back to the point, when I mention that I do digital history and I work on digital preservation I’m often asked questions like “Isn’t that IT? Isn’t that technical? Is that like computer science? Or, library science or something?” Initially I was a bit timid, in responding to these queries. I was still finding my way through a highly technical field myself. I’d assert that understanding the born digital records of our society are in fact very important to historians. But I’ve been becoming bolder in this regard.

Trying not to Define the Digital Humanities

Yes, digital preservation is a technical field, one that requires technical skills. However, it also requires extensive technical skills in, say German to be able to be a good Art Historian studying Modern German Art. An understanding of digital artifacts should be a central part of the emergent digital humanities.

What Google Image Search's Hive Mind thinks the Digital Humanities is/are.

What Google Image Search’s Hive Mind thinks the Digital Humanities is/are.

This brings us to the second part of the title. What does digital preservation have to do with the emergent field of digital humanities. The digital humanities are different things to different people and I don’t want to spend too much time trying to define it/them. Again, in google image search’s hive mind the digital humanities have something to do with word clouds, projects, debates and logos.

Working Definitions of the Digital Humanities

In any event, I see three primary areas of activity in DH.

  1. Computational Analytic Methods: Here I’m thinking about computational approaches to studying primary sources (think here of Google’s n-gram viewer, of corpus analysis, of various and sundry ways of using computers to count things and conduct distant reading),
  2. Experimentations in the Format of Scholarship: Here I’m thinking about work on the future of digital scholarly communication and publication (new kinds of journals, about digital scholarship, projects like Ed AyersValley of the Shadow, various kinds of online exhibitions and presentations of primary sources using platforms like Omeka),
  3. Interpreting the digital record: interpreting born digital primary sources. This last area is essential to the future of the first two.

If the digital humanities is ever to study the 21tst century that study is going to be based on born digital primary sources. We need forms of digital Hermeneutics, the reflexive process of interpretation at the heart of humanities scholarship, that fit with digital texts and artifacts.

Selection and Definition: Points of Contact Between Humanists and Preservers

Importantly, there are two primary issues that humanists have a lot to offer in shaping the digital historical record. Selection and Definition.

  1. Selection: What is collected and preserved
  2. Definition: What features of digital objects are significant to preserve


We can’t count on benign neglect as a process of waiting to figure out what might matter in the future. The failure rate on most consumer grade digital media is much, much shorter than the failure rate on analog media. Further, when digital media fail it’s often complete, as opposed to being partially recoverable. To that end, there is a need for many to follow in the footsteps of projects like the Center for History and New Media’s September 11th Digital Archive, where a group of historians intervened and launched a site to crowdsource the collection of everything from text messages, emails, and other digital traces of the attacks for future historians to make sense of them. Learning lessons from areas like oral history collection, it is essential for historians to wade in and actively work to ensure that the digital ephemera of society will be available to historians of the future.

The point about selection is important, but it’s largely contiguous with current practices. Decisions about selection for collections are always fraught and contingent on the values and perspective of the collecting institution. Far more problematic, is the fact that the very essence of what a digital object is is itself contentious and dependent on the kinds of questions one is interested in.

What is Pitfall? It depends on what your research questions are.

What is Pitfall? It depends on what your research questions are.

For instance, what is Pitfall? Is it the binary source code, is it the assembly code written on the wafer inside the cartridge, is it the cartridge and the packaging, is it what the game looks like on the screen? Any Screen? Or is it what the game looked like on a cathode ray tube screen? What about an arcade cabinet that plays the game? The answer is, that these are all pitfall. However, for different people; individual scholars, patrons, users, etc. what Pitfall is is different. If humanists want to have the right kind of thing around to work from they need to be involved in pinning down what features of different types of objects matter for what circumstances.

This point is expansive, so I’ll briefly gloss it before going into depth on each of these topics. In keeping with much of the discourse of computing in contemporary society, there is a push toward technological solutionism that seeks to “solve” a problem like digital preservation. I suggest that there isn’t a problem, so much as there are myriad local problems contingent on what different communities’ value. With that said, this is not a situation of “anything goes” digital media are material, and based on inscription, a set of insights from new media studies which offers a new basis for us to develop a an approach to source analysis and criticism that has a long standing history in fields like textual scholarship


One of the biggest problems in digital preservation is that there is a persistent belief by many that the problem at hand is technical. Or that, digital preservation is a problem that can be solved. I’m borrowing this term from Evegeny Morozov, who himself borrowed the term solutionism from architecture. Design theorist, Michael Dobbins explains, “Solutionism presumes rather than investigates the problem it is trying to solve, reaching for the answer before the questions have been fully asked.” Stated otherwise, digital preservation, ensuring long term access to digital information, is not so much a straightforward problem of keeping digital stuff around, but a complex and multifaceted problem about what matters about all this digital stuff in different current and future contexts.

The technological solutionism of computing in contemporary society can easily seduce and delude us into thinking that there could be some kind of “preserve button”. Or that we could right click on the folder of American Culture on the metaphorical desktop of the world and click “Preserve as…” In fact, as noted in the case of Pitfall! defining what it is that one wants to keep around is itself a vexing issue. In digital preservation this problem is often smuggled into the notion of “significant properties.”


Chimerical Significance

The problem that is all too often swept away in technical discussions of preservation is what is to be preserved. That is, in established practices for digital preservation, like web archiving, attempting to preserve rendered content is the assumed solution. Just grab the HTML and files displayed when an HTTP request is made and then play them back in a tool like the wayback machine. With that noted, it’s critical to realize that making sense of and interpreting, performing if you will, that content is itself a complex dance involving differing ideas about authenticity.

In the case of a web page, is it its source code, or what it looks like rendered? Is it what it looks like rendered on the particular version of the particular browser it was composed to be viewed on? Is it what it looks like when it runs on a computer with a particular vintage of internal memory clock that produces part of how visual elements flicker? If you are only interested in the textual record of the site, then the text is all you need. But if you are a conservator of net art and this happens to be an important work, you may need to spend considerable time doing ticky tacky work to ensure that the work retains it’s fidelity to it’s creators intent.

To make this a bit more concrete, we can turn to a small corner of a now extinct neighborhood in Geocities. For those unfamiliar, Geocities was an early online community which Yahoo! turned off in 2009. Due largely to the work of ArchiveTeam, a self described group of rogue archivists, much of Geocities was collected and distributed. Looking at a small sliver of that archive can underscore some of the issues at the heart of the problem of preserving and accessing this kind of material.

Geocities page viewed through the Internet Archive's Wayback Machine

Geocities page viewed through the Internet Archive’s Wayback Machine

Same Geocities site as presented in One Terabyte of the Kilobyte Age.

Same Geocities site as presented in One Terabyte of the Kilobyte Age.

Here are two images of archived copies of a spot in the Capitol Hill neighborhood of Geocities. This first one is what it looks like rendered on my browser at work. This second one, is what it looks like as presented in One Kilobyte of the Terabyte Age. Created by Olia Lialina & Dragan Espenschied. One Terabyte of Kilobyte Age,  is in effect a designed reenactment of geocities grounded in an articulated approach to accessibility and authenticity which plays out in an ongoing stream of posts to a tumblr account. Back to the two images: Note that the header image is missing in the first one, as displayed in my modern browser. The image is still there, but my browser isn’t doing a good job at creating a high fidelity presentation of what the site should look like.

The point is, that you can’t just “preserve it” because the essence of what matters about “it” is something that is contextually dependent on the way of being and seeing in the world that you have decided to privilege. In the case of something like Geocities, it turns out that there are a bunch of different decisions one can make about fidelity and authenticity and different collections are taking different approaches.

Dragan's take on the trade offs inherent in different approaches to authenticity and accessibility for preserving webpages.

Dragan’s take on the trade offs inherent in different approaches to authenticity and accessibility for preserving webpages.

Dragan’s vision for the presentation is anchored in this continuum of authenticity and accessibility across the entire stack of technologies at play in the presentation of a web page. That is, One Kilobyte of the Terabyte age is a kind of critical edition (a mainstay as a scholarly product) of geocities. Unlike many other web archiving projects, Dragan is very upfront about what it is that he has decided to privilege and focus on in this special collection or critical edition of geocities. The resource he has created here is both an interpretation and a point of access into some of the most significant properties of Geocities that might otherwise be lost.

In short, deciding what it is that one want’s to keep is vexing and problematic, with that said, it is critical to note that we do actually have something to hang on to here. There is in fact a there there when it comes to digital objects. Further, the work of humanities scholars to understand the fundamental forensic and textual traces of digital objects points the way forward to a hermeneutics, an interpretive approach to understanding and studying digital primary sources. The most essential work in this area is Mathew Kirshenbaum’s work in Mechanisms: New Media and the Forensic Imagination.

Materiality & Inscription

We all know that digital media is binary, that somewhere there are screens of ones and zeros doing something like in the Matrix.


The binary essence of digital media, the one’s and the zeros of it all, are in fact texts. Inscribed at the limits of augmented human perception, the sequences of bits on a hard drive are still very much material. Inscribed in the sectors of a disk are files in formats intended to be read and interpreted by different pieces of software, software which is itself inscribed on different pieces of storage media. The point here is that the longstanding traditions of studying texts, of interpreting them, have a home at the basic root level of digital objects which are both sequences of textual information and material culture visible in magnetic flux transitions on disk or the pits on optical media.


The structures of this media share an affinity with a strand of archival theory too.

Media and Data Structures as Fonds

Whatever your feelings about the imperative of the archivist to Respect Des Fonds, the imposition to maintain original order and to pay attention to provenance of materials, it remains a cornerstone of the identity and professional practice of archives. Attempting to maintain the original order in which materials were managed before being accessioned and making decisions when processing an archive with respect to the whole both suggest a kind of archeological or paleontological understanding of documents, records and objects. An Object’s meaning is always to be understood in context of the objects near it and the structure it is organized in.

In the analog world, it’s often difficult to infer what that order is. For instance, the Herbert A Philbrick papers came to the Library of Congress in a mixture of boxes and trash cans.


Contrast that with the order of a floppy disk from playwright John Larson’s papers. Irrelevant of his own strategies for organizing his data, and his .trashes, the computer saves and stores information like the time he last opened the files. (For more on this example, see the work of Doug Reside, Digital Curator for the Preforming Arts New York Public Library)


The logic of digital media, of data structures, is one of order. Even if a user tries to eschew that order, the machine insists on creating, storing and retaining all manner of technical metadata and time stamps.

The order of bits on a disk, the structure of files in a file system, the organization and structure in of data available from an API are each fonds like. Data and records accrue according to the process and logic of digital media. Just as the structure and organization of records and knowledge in the analog world says as much about the materials as what is inside them so is the same true in the digital. The layers of sediment in which something is found enables you to understand its relationship to other things. Context is itself a text to be read.

With this noted, other humanities scholars, have clarified that all too often we privilege one mode of reading that underlying data structure. Our knee jerk reaction is that what is significant about an digital object is what it looks like or does on the screen.

Screen Essentialism

Digital objects are encoded information. They are bits encoded on some sort of medium. We use various kinds of software to interact with and understand those bits. In the simplest terms software reads those bits and renders them. However, the default application for opening a file isn’t the only way to go about it. You can get a sense of how different software reads different objects by changing their file extensions and opening them with the wrong application.

For example, if you just change the file extension of an .mp3 to .txt and then open the file up in your text editor of choice, you can see what happens when your computer attempts to read an audio file as a text. Slide24

While this is a big mess, notice that you read some text in there. Notice where it says “ID3″ at the top, and where you can see some text about the object and information about the collection. What you are reading is embeded metadata, a bit of text that is written into the file. The text editor can make sense of those particular arrangements of information as text.


Here is an.mp3 and a .wav file of the same original recording changed to a .raw file and opened in Photoshop. Look at the difference between the .mp3 on the left and the .wav on the right. What I like about this comparison is that you can see the massive difference between the size of the files visualized in how they are read as images. Notice how much smaller the black and white squares are. It’s also neat to see a visual representation of the different structure of these two kinds of files. You get a feel for the patterns in their data.

These different readings or performances of a file aren’t particularly revelatory, except to underscore that the very act of opening a file, of seeing its contents is a process of interpretation a text. The sequence of 1’s and 0’s is enacted in front of us by software. Formats and software are themselves essential actants in this performance which other humanities scholars have done great work to help us understand.

Format and Medium in Platform Study

In a detailed study of the Atari 2600, Nick Montfort and Ian Bogost suggest that the study of software inevitably involves the study of layers of software on top of software intertwined with particular pieces of hardware. For example, the tiny amounts of RAM in the 2600 resulted in a complicated problem for programmers to display graphics. They extensively discuss the game Pitfall, so we can return again to its example.

Illustration from Montfort and Bogost's Racing the Beam

Illustration from Montfort and Bogost’s Racing the Beam

This illustration shows what the game screen looks like from inside the system. Note what we see on the screen, the area with the fellow swinging there, is really just a small portion of how the game thinks of its screen. The three large areas (vertical blank, horizontal blank, and overscan, are actually where the computations necessary for keeping score and working through the game are done. In this case, being able to understand how a game like Pitfall was innovative is intimately connected to being able to actually understand the relationship between the game’s functionality and the underlying constraints of the Atari Platform. For those interested in presentation it further complicates the idea of collecting and preserving such an artifact as a more nuanced understanding of the platform continues to reveal important, seemingly hidden, characteristics of its nature.

Going forward, Bogost and Montfort’s notion of “platform studies” should be come increasingly important to those working to preserve digital artifacts.

From their perspective, the layers in these platforms provide particular affordances and constraints but are generally taken for granted by users as a part of the platform. In this case, Platform could be anything from a piece of hardware, like the 2600, a programing language like c++, Java, or Python, or a format, like MP3, or .gif, or a set of protocols, like HTTP and the DNS, or something like Adobe Flash that provides a language and runtime environment for works.

I’ll quote Montfort and Bogost’s explanation of platforms here at length as it is particularly pertinent.

By choosing a platform, new media creators simplify development and delivery in many ways. Their work is supported and constrained by what this platform can do. Sometimes the influence is obvious: A monochrome platform can’t display color, a video game console without a keyboard can’t accept typed input. But there are more subtle ways that platforms interact with creative production, due to the idioms of programming that a language supports or due to transistor-level decisions made in video and audio hardware. In addition to allowing certain developments and precluding others, platforms also encourage and discourage different sorts of expressive new media work. In drawing raster graphics, the difference between setting up one scan line at a time, having video RAM with support for tiles and sprites, or having a native 3D model can end up being much more important than resolution or color depth.

The point is as follows, the nested nature of platforms, their ties in and out of software and hardware and culture are the essential problem of digital preservation and a key question for anyone interested in long term access to our digital records to grapple with. Our world increasingly runs on software and hardware platforms. From operating streetlights and financial markets, to producing music and film, to conducting research and scholarship in the sciences and the humanities, software platforms shape and structure our lives. Software platforms are simultaneously a baseline infrastructure and a mode of creative expression. It is both the key to accessing and making sense of digital objects and an increasingly important historical artifact in its own right. When historians write the social, political, economic and cultural history of the 21st century they will need to consult the platforms of our times. As underscored already, even defining the boundaries of such works is itself a fraught and interpretive project. For this reason alone I firmly believe that digital preservation is a primary challenge which should pique the interest of digital humanists.

To recap, in work on the materiality of digital objects, in conceptions like screen essentialism, humanists are already providing critical information for those interested in collecting and preserving the digital record.

Example’s like Dragan’s work with Geocities illustrate how there is considerable value in closer collaboration here, where scholars actually dig in and create special collections or critical editions of digital records to clarify the perspective taken in their collection.

Aside from this, I think there is one other key reason that digital primary sources should cry out for the attention of digital humanities.

The Born Digital Record is Already Computable 

When I opened my talk, I noted that to many, the digital humanities is synonymous with computational approaches to studying texts. Importantly, coming around from the other side of this, consideration of digital primary source for digital preservation, we end up with far, far, far more computable data then the digitized corpora of historical texts which occupy many of those interested in doing computational research in the humanities are working from.

Where historical works must be digitized, born digital media is by definition already computable. That is, when we gather together aggregations of data, be they web archives, aggregates of selfies from instagram, or corpora of files from software packages, they are already computable.

In a talk about working with web archives, Historian Ian Milligan stated the problem concisely.

If history is to continue as the leading discipline in understanding the social and cultural past, decisive movement towards the digital is necessary. Every day most people generate born-digital information that if held in a traditional archive would form a sea of boxes, folders, and unstructured data. We need to be ready.

In short, the future of the computational humanities is itself going to be turning to the increasingly heterogeneous digital fonds, data sets, data dumps, corpora of software and images and logs of transactional data.


The Praxis of Digital Preservation

Dialog with areas of work in the humanities are all essential to the future of digital preservation.

What we need is a generation of conservators, archivists, and historians with extensive technical chops who realize just how contingent and complex deciding what bits to keep and how to go about keeping them is.

Digital objects, artifacts, texts, and data are something more than “content” they are the material anchors, the primary sources, through which we can interpret, critique, and understand our society.

I firmly believe that ours should be a golden age for born-digital special collections, archives, troves and critical editions. The future of digital preservation is less about defining a hegemonic set of best practices, than it is about scholars, curators, conservators and archivists working together to define what it is that they value about some kind of digital content and to then go out and collect it and make it available for use to their constituencies. It is about setting definitions that are often at odds with each other but that are coherent toward their own ends.


Posted in Uncategorized | 3 Comments

A Draft Style Guide for Digital Collection Hypertexts


The cover of A Signal from Mars: March and Two Step, shows the rather civilized Martians relaying a piece of music to earthlings with the use of a spotlight. As featured in Messages to and From Outerspace.  A signal from Mars1901.Music DivisionThe Library of Congress.

I spent about 60% of my work hours last year selecting a thematic collection of 330 cultural heritage objects and interpreting and explicating facets of those objects  in a set of 18 linked essays. I had a style guide for questions of grammar, and the HTML structure of the layouts were rather straightforward. However, I realized rather quickly that if I was going to do this consistently I should put together my own set of guidelines for the actual structure, function and style I would use for approaching this writing project. Nothing about this is formal or official or anything like that. This is just my own personal notes, thoughts and reflections that informed how I approached framing the work.

What follows is the short list of guidelines/rules for composing online exhibition-ish narrative pages for the web which I developed for my own use. Given some recent great discussion of what the ideal for history on the web should be, I figured I would share the rules I set for myself as they might be of use to others working in this form. Ultimately, in the collection objectives section I decided to call it a “hypertext,” which ideally expresses

The Chimera of the Digital Collection Hypertext

An online only interpretive presentation of representations of cultural heritage objects is something of a chimeric creature. It’s the sort of online collection/interpretive material that all kinds of folks develop when they use platforms like Omeka—ticky-tacky interpretive analytical writing and explication alongside a massive pile of related historical primary sources for users to go out and explore on their own.

  • Part Exhibition: It’s purpose is similar in purpose to a physical museum exhibit, except that the restraints and benefits of physical space are absent. For example, an online exhibition can sprawl out forever, but you lose out on the quality of “being there” in the presence of “being there with the artifacts.
  • Part Illustrated Publication: As text and images on a web page, they are also like those “illustrated history” books, where one works through a linear narrative but can stop off to read detailed information about an image. In this case, the similarity falls off in that hypertext provides a much more networked and connective potential structure for an online text. Furthermore, while people do skim books, web reading is fundamentally different.
  • Part Expansive Collection of Sources: Where you only have the space to show an image on part of a page in a book, and there is a limit to what you can display in the physical space of an exhibit on the web you can provide links out to every page in a draft or the whole audio recording.
  • All Hypertext: Ultimately, I think the most precise term for what these things are is hypertext. A term that sadly fell out of vogue with cyberspace a while back, but a term I think is worth going back to as HTTP is itself the defining logic and form of the web.

A Ready-to-hand Draft Style Guide

I had some web writing information to work with, but I ended up working up my own style guide-ish set of rules to work from for putting together these pieces. What follows is my rundown of rules (most of which I didn’t break much). As such, the intention of this set of guidelines was to try and take the ideas of exhibition and print publications that make extensive use of deep captions and figure out how they fit into the way the web writing works and people engage with the web. I feel like these served me well, and figured others might be interested in them. I’d similarly be interested in comments/discussion of these.

  1. Every narrative page stands on it’s own: The web is not a physical space and you have no control over what page someone will see first. The result of this fact is that a well conceived online exhibition narrative page needs to stand on it’s own. That means it needs to have a compelling title that includes key terms in the page, and that the text of a page cannot assume that a reader has read any other text in the exhibition. Every page is effectively the first page/front door for some set of potential users. It’s critical that the page stand on its own and invite users for further exploration at every turn.
  2. Every caption should explicate/interpret the image/object presented. Images, audio and moving image content needs to be captioned in such a way that the captions explicate and interpret the items. It is not enough to simply say what something is but to scaffold a visitor into seeing what is important about the artifact in this context. Ideally, the way the object is presented/cropped/edited suggests part of this, that is helps to actually show and not just tell. Part of the purpose of presenting these objects is to demonstrate reading and interpreting them. As such, they should not be extraneous. For example, if one want’s to include a portrait of an individual one should not simply say it is a portrait of them. It’s necessary to suggest points in the work to read, like the way they are drawn or items they are holding and how those communicate something about how that individual is being represented in this case.
  3. Object captions should always stand on their own: The captions for objects presented should also stand on their own. Web readers skim and make use of images as a form of visual headings. As such the captions for those images should make enough sense on their own that visitors can use them as a different index to the content of the page.
  4. A new heading should break up text after every few paragraphs: Again, Web writing is different from print writing in that web readers are far more likely to skim content. Good and frequent use of headings makes it easy to skim text and further hook readers to dig into the narrative content. Think more Associated Press style and less Chicago Manual of Style.
  5. An image from an item should always be visible as one scrolls through the page: The goal is showcasing the objects, so there should always be items from the collection visible on the screen at any given moment. This focuses attention on the items while also making the page easier to explore and read. Note: This is a particularly vexing thing to deal with in responsive design for mobile devices. I’d be curious for ideas about how this point should change in a mobile situation.
  6. Each page should be in the long blog post sweet spot–700-2000 words: This length makes them substantive enough to tell an interesting story and make a few important points but keeps them from being too long that they are difficult to briefly explore. If a piece is getting significantly longer than this it could likely be broken into smaller individual pieces which would have the benefit of creating another page that serves as it’s own point of entry into the exhibition.
  7. Hyperlink text for connections and emphasis: Each two paragraphs should have at least one hyperlink connecting to an important concept in another section of the exhibit. The links underscore what matters in a given paragraph and make it easy for visitors to chart their own path through the exhibition. This is the primary power of hypertext as a medium. Think of how rich a Wikipedia page entry is with links. The goal of this, and many of these guidelines, is to create a fertile network of connections that can spur the ability for someone to get lost in the content much like people do with Wikipedia. Ideally, item pages will record essays that link to them too, making each item itself into a potential point of entry to the presentation.
  8. Links should connect consistently connect out across subsections : Each page in the exhibit should ideally include at least one hyperlink to a page in a completely different section. Silos are bad, and history is not a straightforward progression of events. If you think different thematic sections of an exhibition are coherent enough to hang together there should be connections between individual pieces as you go.
  9. Show parts of items, link out to whole items: Unlike a physical exhibition you are not limited by the size of a frame, showing one page in a book, or putting a video on loop and hoping that people will stick around for it to come back again. Good exhibition narrative pages direct a visitor’s attention to features of items that are particularly interesting in a given context, but ideally that user is just a click away from looking at the whole of a work, or seeing things next to a given letter in a particular folder. There will be cases where this is impossible as either a strain on resources to digitize, or for rights reasons. With that noted, the ideal is to put up as whole a copy of any primary sources that can be integrated in their own right and not to simply crop photos to frame to illustrate the narrative.

 What do you think?

Are there things you would add, refine, or take off the list? Do you have any suggestions for other kinds of guidance that is worth integrating with this sort of thing? What thoughts do you have about how this sort of thing would change given different potential audiences? In short, I’m curious to hear what you think of all of this.

Posted in Uncategorized | 8 Comments

Redefining the “Life of the Mind” & the Infrastructure of Knowledge in the Digital Humanities Center

If you haven’t read it, Bethany Nowviskie recent post responding to the question “Does every research library need a digital humanities center?” go do so. It’s really good. DH+Lib put out a call for further discussion/response to the issues Bethany raised so I thought I would post a few quick comments here. So, this is a quick and brief response to some of the issues raised. Something more than a tweet, but not necessarily as fully formed as some of my other blog posts.

Research Libraries as Infrastructure for Humanities Scholarship
To me, what is really exciting about the digital humanities is that a lot of the work in the field is actually about redefining what the products and process of scholarship should be. It’s not just about doing things and writing books and articles about them, it’s also about figuring out how everything from blogs, to web applications, to mobile apps, data sets, and a range of tools can themselves be scholarly products.

It’s a bit of a caricature and a gloss over a lot of the hybrid roles that libraries have played in scholarship, but I think the following is a functional definition of how many think about research libraries relationship to humanities scholars.

  1. Scholars use libraries as an access point to “the literature” (books and journal articles).
  2. Scholars then publish their work, adding to the literature.
  3. Then libraries collect that new work and the cycle repeats.

Again, there are a lot of awesome other things that research libraries do, but I’d suggest that this is the primary mode through which they are thought of. As an instrument for access to knowledge. In this bifurcation, the scholars live the life of the mind and make scholarship and the research library is the infrastructure that enables them to do so.

Redefining Products and Process of the Life of the Mind

The digital humanities centers I’m most excited about are an amazing kind of scholarly middle ground; places where scholars from different research traditions work alongside librarians, archivists, software engineers, system administrators, usability and human computer interaction experts and project managers to invent a new kind of knowledge infrastructure.

What is critical here, is that the product of scholarship; the book and the article, are being called into question. The DH center as humanities skunk-works has significant implications for the idea of who serves whom, of what scholarship itself is, and holds the potential for a significant reinvention of the roles of a range of information professionals in the work/labor/and life of the mind in research and scholarship.

Digital Humanities Centers Without Scholars

To illustrate just how independent this kind of activity can be from service to scholars, I’d suggest that one of the most successful centers of DH activity isn’t built to serve scholars as much as it’s built to serve the public. New York Public Library’s Lab, NYPL Labs, is a powerful example of what the possibilities are for the digital humanities in research libraries. In part, because it’s not a service to researchers model at all. I imagine many wouldn’t classify NYPL labs as a DH center at all, likely because it doesn’t have this kind of relationship with scholars. I’d argue that the fact that they consistently win grants from the Office of Digital Humanities as the best definition of the fact that they are a DH center. If you look across their work you see the work of engaged and thoughtful creative professionals working on reinventing the infrastructure of knowledge and scholarship. That impulse in the digital humanities has considerable value to contribute to the core mission of research libraries.

Posted in Uncategorized | Leave a comment

Read my dissertation if you like: Designing Online Communities

I defended my dissertation today. If you’re at all interested you can read the draft I defended here. While it The event brings to the end about 23 years of continuous education. (I’ve been working full time for the last seven of those, but nonetheless, going to school for the last 23 years.) While it was accepted as is, I am still going to be doing some format tweaking and copyediting as it goes through its process to get its final signatures. Ultimately that final version will go into GMU’s digital repository. With that said, several folks were interested in reading the draft as it is now, so I figured I would share it here.

accepted as is

The Ideology, Rhetoric and Logic of Online Community Over Time

The diagram below is, by and large, the crux of the argument I ended up developing in the dissertation. For the most part, ideas of online community shift toward a communitarian set of language focused on electronic democracy in the early Web. That utopian vision is further and further undercut as it turns into a discourse of permission and control. The features of early online discussion systems harden into platforms like phpBB and vBulletin and ultimately pave the way for elaborate reputation systems in social networks. It’s a lot more complicated than that, so read the dissertation if that sounds interesting.

Crux of my dissertation


Title: Designing Online Communities: How Designers, Developers, Community Managers, and Software Structure Discourse and Knowledge Production on the Web

Abstract: Discussion on the Web is mediated through layers of software and protocols. As scholars increasingly study communication and learning on the web it is essential to consider how site administrators, programmers, and designers create interfaces and enable functionality. The managers, administrators, and designers of online communities can turn to more than 20 years of technical books for guidance on how to design online communities toward particular objectives. Through analysis of this “how-to” literature, this dissertation explores the discourse of design and configuration that partially structures online communities and later social networks. Tracking the history of notions of community in these books suggests the emergence of a logic of permission and control. Online community defies many conventional notions of community. Participants are increasingly treated as “users”, or even as commodities themselves to be used. Through consideration of the particular tactics of these administrators, this study suggests how researchers should approach the study and analysis of the records of online communities.

Dissertation Defense


Posted in Uncategorized | 5 Comments

Curating Science, Software and Strides in Digital Stewardship: A Personal 2013 Year in Review

It’s that time of year. Time to take stock and provide an accounting. Looking back, all the themes I noted from 2012 carried through in 2013. That kind of continuity is itself exciting, it makes me think I’ve got a career/body of work emerging from what at times can feel like a flurry of activity and projects.

What follows is a quick run down of things I’ve been working on. This includes work from the office, from school, and those moments stolen away to write while on the commuter train spent working on a range of independent projects. In looking back I think I’ve spent a good bit of time focusing on the future of primary sources and scholarship in history, infrastructure and strategy for digital stewardship and on interpreting and presenting the history of science on the web.

Showing Bill Nye Carl Sagan's Papers, a personal highlight of the year.

Showing Bill Nye Carl Sagan’s Papers, a personal highlight of the year.

Future History

Orchestrating the Preserving.exe Software Preservation Summit: I’m very proud of the software preservation summit I played a role in this year. It was great to be able to take an idea from it’s inception about a year and a half ago through to it’s completion. There was great lead up to the meeting on the Signal blog, including this interview with Henry Lowood on video game preservation at scale. Discussions and presentations at the summit were well received, I know everybody left with a lot of excitement about some of the collections being developed and the role that emulation and virtualization is likely to play in the future of access for these collections. I’m thrilled with how well the Preserving.exe report for the meeting came out.

Meditations on Digital Objects as Primary Sources: Continuing some of my work from last year, I wrote a bit about the future of significance and equivalence, about the recursive nature of items and collections, about traces, significance and preservation, about connections between archival theory, stratigraphy and disk images,  and learned a ton doing this interview about historicizing digital preservation with perspectives from media studies and science and technology studies.

Three books essays of mine appeared in this year; Writing History in the Digital Age, Playing with the Past, and Rethoric, Composition, Play

Three books essays of mine appeared in this year; Writing History in the Digital Age, Playing with the Past, and Rethoric, Composition, Play

Digital History and the Future of Historical Scholarship: I started this year remotely offering my perspectives on the of an early career digital historian at the annual meeting of the American Historical Association. I ended up throwing down a bit on the American Historical Association’s dissertation embargo statement was asked to comment on the recent Organization of American Historians similar statement. In short, I’m becoming increasingly interested in working on the modes historians access and work with primary sources and the kinds of scholarly communication products they create as a result.

Closing in on the Dissertation: Earlier this year I defended my dissertation proposal. If you are at all interested in the history of the design and rhetoric of online communities consider reading my proposal. I’m looking forward to carrying some of that thesis work forward into some of my job next year further exploring preserving online communities and the vernacular web. I’m thrilled to report that I have a full draft of my thesis in hand and that it has already gone through one round of review by my thesis committee. I’m looking at defending the thesis in the early spring. I won’t be embargoing it, so you can expect to be able to download it in full from GMU’s open access dissertation repository and here on my website as soon as it’s done.

Some scratches from my notebook where I was figuring out some themes for my dissertation conclusions.

Some scratches from my notebook where I was figuring out some themes for my dissertation conclusions.

Exhibition in and of the Digital Age: Alongside the Digital Preservation 2013 meeting, I had the chance to coordinate CURATEcamp Exhibition: Exhibition in and of the Digital Age. Together with my un-conference-chairs Michael Edson from the Smithsonian Institution and Sharon Leon from the Roy Rosenzweig Center for History and New Media I kept the plates spinning on a great and far ranging set of discussions on the future of exhibition. There were sessions on the future of online exhibits, on visualization as a mode of exhibition, on exhibition of born digital works, and a range of other issues. You can read notes from many of the sessions up on the CURATEcamp wiki. I’m still processing and digesting some of the ideas shaken loose from the camp, so expect more from me next year on some if the ideas and implications of those discussions. Some of this percolated up in thinking through a museum’s acquisition of an historic iPhone. 

From Past Player to Past Editor: This year I took on the role of co-editor of Play the Past, alongside Shawn Graham. It’s been a lot of work, I appreciate everything Ethan Watrall did to get the blog up an running and keep it running. When I started my primary goal was to get more activity through guest posts and getting new bloggers into the fold. I’m thrilled to have Angela Cox and David Hussey join the blog and contribute a lot of amazing work alongside a range of great guest posters. In short, I think we have seen a lot of great and diverse work on the blog and I’m looking forward to seeing where it goes into the future.

Infrastructures and Strategy for Digital Stewardship

Crowds & Roles for Public in Digital Library, Archives and Museum Projects: The year started off with the publication of a lot of my ideas on public participation in cultural heritage in Digital Cultural Heritage and the Crowd, in Curator: The Museum Journal. I interviewed Arfon Smith of Galaxy Zoo and the Adler Planetarium about the role of citizen science projects in digital stewardship and cultural heritage. I also wrote a bit about the role that citizen science projects can play in informing science education. My conversation with Mary Flanagan about her Metadata Games crowdsourcing platform ended up being one of the top Signal posts for the year. This year at THATcamp prime, a group of us thought through how crowdsourcing might be applied to explore images from inside the wealth of digitized books out there, and then actually stood up an instance of Metadata Games to run against images we stripped out of some Project Guttenberg books. I tried to spark some conversation about how cultural heritage orgs could shift their workflows to better anticipate activity of the crowd but it didn’t really go anywhere. Yet.

Open Source and Digital Stewardship: I had a nice set of interviews on the role of open source in digital preservation and stewardship come out. I talked with Peter Murray on when OSS is the right choice for cultural heritage orgs. Tom Cramer and I discussed the approach that Hydra is taking. I talked with Don Mennerich from NYPL about his work on born digital manuscript materials and got some of Cal Lee’s perspective on the same issue in this interview on BitCurator.

Pushing Out the Levels of Digital Preservation: Earlier this year saw the publication of the first version of the NDSA levels of digital preservation and a paper on them. It’s the result of a great little sub group of folks from NDSA member organizations and I think we have a lot to be proud of in it. I’ve been thrilled to see all the ways this  guidance is being used to inform practice at organizations all over the place (ex. at USGS, ARTstor, TRC Canada, MetaArchive, and Mississippi’s Archives.

Contributing to the National Agenda for Digital Stewardship: I’m thrilled to have a part in shaping the first National Agenda for Digital Stewardship. I think the document is a real triumph for the NDSA, it outlines a lot of issues that matter and it’s unique in getting more than a hundred some organizations to speak with one voice about national priorities. As the co-chair of the NDSA Infrastructure working group, I had a hand in shaping a good bit of the infrastructure section.

Special Curator for a History of Science Project

This year I’ve been thrilled to have the chance to spend the bulk of my work time on a history of science project. The work is mostly finished, but it’s not out yet so I can’t talk about it much right now. But I can talk about a few pieces of that work that are public. 

The most important thing in the universe by L.M. Glackens. Cover from Puck, v. 60, November 7, 1906.

You can get a taste of some of the work I’ve been engaged in up on a two of the LC blogs. I’m rather happy with this piece I wrote about visions of earth from space before we went there, which was picked up by Smithsonian magazine and by Popular Science. I also wrote about the history of imaginary space ships.

I also wrote a series of pieces on how science teachers can use some historical astronomy items as teaching tools. I’m really happy with how each of these turned out.

Not officially a part of my work, but Marjee and I pitched a script for a Ted-Ed video called Is there a center of the universe? which I think turned out to be amazingly cool. 

Center of universe ted video

Display for the Carl Sagan Event: As part of my work I was thrilled to curate a presentation of items from the Carl Sagan papers alongside some rare astronomy books and comics and prints to illustrate how Sagan’s papers fit into both historical and fictional ideas about life on other worlds in the Library of Congress collections. A high point there for me was when I got to show Bill Nye through some of the Sagan papers.


Posted in Uncategorized | Leave a comment

Mass Digitization, Archives, and a Multiplicity of Orders & Arrangements

Quick, drop everything and read All Text Considered: A Perspective on Mass Digitizing and Archival Processing. It helped me think through some of what I was getting into in Implications for Digital Collections Given Historian’s Research Practices.

The abstract of the paper does a great job at explaining it’s objective, “coupling robust collection-level descriptions to mass digitization and optical character recognition to provide full-text search of unprocessed and backlogged modern collections, bypassing archival processing and the creation of finding aids.” The key point in the piece, is that it’s becoming plausible to see digitization costs as being on par with the actual processing costs of a collection. You can read this as an even more extreme take on MPLP, where digitization would potentially replace a significant part of the processing process itself. Which is exciting/intriguing for a number of reasons, one of which is as a prompt for thinking through a different kind of future for archival description and access.

The possibility of actual original order and a multiplicity of orders

Most of archival original order ends up being it’s own kind of new order. So if/when you do get around to doing some form of arrangement it’s strictly intellectual arrangement, you do so without actually moving anything.  That is, if you did still want to do processing you could do it on the digital files and then provide any number of different identifiers that resolve to the digital files. In essence, the information about original order and any further arrangement would be demoted from the central organizing factor to a relevant and important piece of metadata alongside any other pieces of metadata.  So you have the order things came in and the order the archivist worked out after processing. One would likely do some coarse level of weeding and deaccessioning in many cases before digitizing, but then once digitized a processing archivist would be able to further decide which of the scanned files should be kept and what the permissions for viewing the images are. From there, you just set different permissions, say onsite access, reading room only access, dark archive for x years, complete public access. You could then just work from a black list white list approach to whatever level of granularity an archive decided to process a given collection to. Not to mention, with OCRable archival material the OCR itself could be used to set up some heuristics for what kinds of materials to show to what users in what circumstances.

The container list for an archive enforces a single linear hierarchy on the contents of the archive. Each sheet of paper can only be in one folder, in one box, in one series.

The container list for an archive enforces a single linear hierarchy on the contents of the archive. Each sheet of paper can only be in one folder, in one box, in one series.

Linked Open Description

If the archive just commits to minting a URL structure then this process opens an exciting new future for description. That is, if every image has a URL, and the folder and collection are named in the URL (Ex http://institution.org /division/collection/series/box/folder/image ) then you (or anyone else for that matter) can create a range of descriptions and relationships of those digitized objects. If something comes in substantial disorder, Like the Herbert A. Philbrick Papers, many of which came in the trash can’s pictured here, then you just make a directory for the trash can and number the images based on the order you pull them out of the can. When you do go ahead and arrange the scans, you can do so while retaining the order they were pulled out of the trash can as a parallel set of the persistent metadata element.

The net result is that you are no longer limited by the fact that one atom is stuck in one spot. You just index the content in as many ways as you like. Much like the chaotic storage principles at the heart of the design of organizing Amazon’s warehouses you use the logic, structure and order of the database to transform the order of physical materials into something akin to the random access nature of a hard drive. The result:

  1. You get the benefit not being limited by the fact that a thing can only be in one place at a time.
  2. You are also not limited to one linear/narrative/sequential way to find things
  3. Anyone inside or outside an organization can then set up in house, or third party services, to let stewards/curators add any level of description to any arbitrary set of images. That is, internal and external agents could provide distinct data to organize and structure collection content,  which the institution could chose to harvest and display to the extent they were interested. Since you are actually minting URL’s you could then start to watch inbound links to your items from things like citations and pull those links in as a kind of descriptive trackback.
If everything is digitized and each image is given an ID then any number of different modes of arrangement could be minted and maintained referencing the images. Making it function much more like this distributed network. The Network by @nancywhite, CC-BY

If everything is digitized and each image is given an ID then any number of different modes of arrangement could be minted and maintained referencing the images. Making it function much more like this distributed network. The Network by @nancywhite, CC-BY

Paralyzing or Paralleling Workflows for Archives

I think this could also help to break up much of the serial nature of workflows for cultural heritage orgs. That is, if you digitize everything and give them persistent URLs that mean things then you could have any number of processes like arrangement, description, OCR, and even processes for automated description like topic modeling run against your materials in a much more parallel fashion. If we started giving persistent URLs to these images at the beginning of our workflows instead of at the end we can reap the benefit of running any number of jobs and processes against them simultaneously. Furthermore, these could happen on a rolling basis, that is you wouldn’t need to wait for any one process to finish before moving on to another. I wrote a bit about this idea in Paralyzing or Paralleling Workflows for THATcamp leadership and a lot of these ideas came up and were discussed at CurateCamp Processing: Processing Data/Processing Collections

All Kinds of Cans of Worms Opened

All Text Considered: A Perspective on Mass Digitizing and Archival Processing opens all kinds of different cans of worms. For some kinds of materials, the prospect of digitization and OCR could make material accessible in shorter order. With that said, it throws open the doors to figure out what exactly intellectual  control means in those circumstances, and what kind of further processing and arrangement one would want to do, or how to go about integrating automated techniques for summarizing and describing content an archivist might use to complement and extend their efforts to make an archive’s structure legible to their users.

I’d love to hear your reactions to some of my provocations here and any other thoughts and reflections the essay prompts in discussion in the comments.

Thanks to Jefferson Bailey, Thomas Padilla, and Ed Summers for comments on a draft of this post. They each had some great ideas and input. I hope they’ll bring some of their more extended comments into the comments here.

Posted in Uncategorized | 2 Comments