Month: May 2014

Mecha-Archivists: Envisioning the Role of Software in the Future of Archives

The Cybermen, exemplify our worst fears about the future of technology. People literally turned into machines replaced and ruled by machines. I think this is the face of a fear of a technological future of archives.

I had the privilege of participating in The Radcliffe Workshop on Technology and Archival Processing a few weeks back. I was thrilled to be on a great panel with some early career historians and Maureen Callahan.

Maureen posted her talk The Value of Archival Description Considered online. I encourage you to read it. It’s super good. I was thrilled to find that, I think we are on nearly the exact same wavelength about the future of the finding aid.

There was a nice write up about the event in the Harvard Gazette. I won’t deny that I may be “a millennial who displayed affection for the word “awesome” during the panel.” However, there are some clarifications I should make. I did not talk about obeying “cyborg overlords”, or a “mechanized shirt of armor.” In sharing some of the points of my talk I thought it would be good to focus in particular on parts of these clarifications. I think getting the language right about the future of our relationships with software is important, so here goes.

Maureen Welcomed the Robot Overloards, but with good reason!

Maureen had a few great lines in her talk (again, if you haven’t read it go do so now). One of those lines was her take on a Simpsons quote, “I for one welcome our robot overlords.” She went on to explain, in an even better line, “I don’t think that archivists are just secretaries for dead people, and I welcome as much automation as we can get for this kind of direct representation of what the records tell us about themselves.” I love this quote. When I was sitting there listening to her I was nodding so much. This is exactly the sentiment I wanted to get at.

The future of digital tools for archives is not replacing the work. It is automating the parts of the work that are not the intellectual labor. Along with that, the future of these tools is largely about taking advantage of the affordances in the nature, structure and order of digital media which give us considerable power to scale up our actions and interventions in the record.

I took the key theme from her pitch to be something like, let the algorithms and digital tools do the repetitive and less intellectual labor of the archivist, and get the archivist more involved in the intellectual labor of the archives. Specifically, in better contextualizing, explaining and describing the provenance of collections and making the decisions that require the kind of sophisticated judgment that people have and exercise. Without knowing where she was going, I touched on several similar themes in my talk. Ideas and visions of the labor relationship between the archivist of the future and the algorithms, scripts and tools that work for her and do her bidding.

The welcoming of Robot overlords

We get to wear the robots!

This lego mecha exo-suit is the vision I think we want for the future of digital tools in archives. Here, this mechanized power armor gives the Archivist super powers. Forget lifting a 30 lb box, in this suit you could move whole collections with ease. But that’s aside from the point. This kind of power tool lets you do a lot of the laborious parts of the work and get back more quickly to the intellectual labors.

So we don’t want the dark vision of the robot master. We certainly don’t want the machines turn us into into the Borg or Cybermen, who lose their souls as they are taken over by the emotionless machine.

My vision for the future of the archivist using digital tools is less Borg and more Exo-suit.

The idea of mecha or exo-suits, illustrates a vision of technology that extends the capabilities of it’s user. That is, the kinds of tools I think we need going forward are exactly the sort of thing that Maureen was talking about. Things that let us automate a range of processes and actions.

We need tools that let us quickly work across massive amounts of items and objects by extending and amplify the seasoned judgment, ethics, wisdom, and expertise of the archivist-in-the-machine.

Fondz as a Tool Thought Experiment for Automation

I was recently working with some archivists who had a project where they had nearly 400 floppy disks containing drafts of letters, books, essays, etc. In short, digital copies of all the kinds of things you find in a collection of someone’s personal papers. I hope to write about that project in more detail in the future, but for now I just wanted to talk a little about a tool that got cooked up in the process. So, what can you do with some 19,000 documents like this? Now, you can learn a ton about a set of digital files by extracting and identifying them in automated processes. That is, what kinds of files they are, their file names, size, etc. It’s really useful data! However, in most cases, this is not at all the data that a researcher or other user who might work with the collection would want. Inevitably, users want to know where information related to x, y, or z is in a collection. That is, users care about topics and subjects, and the kinds of tools most of us have at hand don’t really do much with that.

Here you can see some of the very basic kind of information that is relatively easy to get at with existing tools, numbers of files, their size and their formats. This image shows the files processed and presented by Fondz in a particular test set come from 379 bags (in this case each bag contains a logical disk image). Collectively this includes 18,414 files in 49 formats.

To this end, I asked my colleague Ed Summers a while back if it would be possible to strip out all the text from these documents, topic model it, and then use the topic models as an interface to the documents. In response, he cooked up a tool called Fondz.

For those unfamiliar, the MAchine Learning for LanguagE Toolkit (MALLET) describes topic modeling as follows. “Topic models provide a simple way to analyze large volumes of unlabeled text. A “topic” consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.” In this case a tool like MALLET can quickly look across a large collection of texts and identify topical clusters of terms that appear near each other.

How Edsu describes Fondz on github.

I really like how Ed describes Fondz, so I’ll share it here.

fondz is a command line tool for auto-generating an “archival description” for a set of born digital content found in a bag or series of bags. The name fondz was borrowed from a humorous take on the archival principle of provenance or respect des fonds. fondz works best if you point it at a collection of content that has some thematic unity, such as a collection associated with an individual, family or organization.

Example of the Fondz topic driven interface to documents in an archival collection

Above, you can see an example of Fondz in use. This is a list of the topics that Mallet identified, in each case you see the number of documents associated with the topic on the left and in the blue box you see the terms which Mallet has identified as being associated with that topic. That first one, with 776 documents, ends up being a cluster of files versions of biographical notes and CVs, the third one, with 309 topics, is materials related to a novel and a film adaptation of that novel. Mallet doesn’t know what those topics are. It just sees clusters of terms. Based on my knowledge of the collection, I’m able to identify and name those clusters.

The result of all this is a topical point of entry to explore 19,000 digital files from hundreds of floppies. It would work just as well for OCR’ed text from recent typed and printed text. I can’t show it to you in action because I don’t have a test collection that I can broadly share. (Note, anyone who has a similar collection they can broadly share contact me about it) But take my word for it. You click on one of those topics and you see a list of all the files that are associated with it and if you click on the name of one of those files you end up seeing an HTML representation of all the text inside that file. Alongside this, a future idea would be to integrate tools that do things like Named Entity Extraction (NER) to identify strings of text that look like names of people, places and locations. Indeed, there are already attempts to use NER for disambiguation in cultural heritage collections. What is particularly important here is not that we build tools that do this “right” but that we find and use tools that make things that are “good enough” in that they are useful in helping people explore and find things in collections. This isn’t about robots just doing all the work. It’s about extending and amplifying our ability to make materials available to users in ways that help them “get to the stuff.” Aside from that, there is a need to provide users with information on what actions were preformed on the collection to make it available. To that end, it’s exciting to realize that we can simply document what tools were used so that anyone can explore the potential biases of those tools in how they create interfaces to collection data.

So what does this all have to do with cyborgs and mecha? What is in some ways most interesting to me about topic modeling is that the topics themselves are actually somewhat arbitrary and meaningless. A topic in MALLET isn’t so much a topic in regular parlance as it is just a cluster of words that tend to appear together. It takes someone who knows the texts to make sense of those topics, to fiddle with the dials till they get topics that seem hang together right (in MALLET you pick how many topics you want it to look for). So Fondz will be far more useful when it integrates processes for archivists to exercise their expertise and their judgment and intervene. When they can name the topics and describe them. When they can accept or reject some of the topics, when they can rerun them.

Since the goal here is to make useful descriptions there is a potential here for topic modeling to be used instrumentally to surface connections for an archivist to find useful or not useful and to save the useful ones and describe them. Given that good processing is done with a shovel, not with a tweezers it is exciting to think about how tools like Fondz could integrate a range of techniques for computational analysis of the content of files to act as steam shovels; instruments that put the archivist in the driver’s seat to explore and work through relationships in collection materials and expose those to users.

There are a bunch of other cleaver things that Ed is doing with Fondz that warrant further discussion, but for the purpose of this post that does it. As far as take-away messages go, I’d suggest the following. The future of digital tools for digital archives is not about tools that “just work.” It’s not about replacing the work of archivists with automated processes, it’s about amplifying and extending the capabilities of an archivist to do cleaver things with somewhat blunt instruments (like topic modeling, NER, etc.) that make it easier for us to make materials accessible. Given that the nature of digital objects is a multiplicity of orders and arrangements, if we can generate a range of relatively quick and dirty points of entry to materials we can invest more time and energy in making sure that when someone gets down to the item they have breadcrumbs and information that situates and contextualizes the item in it’s collection and it’s custodial history. We need archival-mecha, tools that give archivists superpowers by amplifying their judgment, wisdom, knowledge, ethics and expertise in working with digital materials. We need to make sure we are getting the computers to do what computers do best in supporting the praxis of archival practice.

May 27, 2014
Einstein as Science Santa: Monumental Meanings & Wil Wheaton

Recently, Wil Wheaton posted a picture and quote on twitter and his blog (That time I met Albert Einstein) making use of the Albert Einstein Memorial at the National Academy of Science. It’s great, he is sitting on Einstein’s lap, making requests to Einstein as a kind of physics Santa. I really love how the post, and all the likes and favorites it has gotten reinforces a set of points I made about the memorial in my essay Tripadvisor rates Einstein: using the social web to unpack the public meanings of a cultural heritage site.

I love the way this photo fits with much of the informal and playful ways that other photos of the monument work. Here is a bit of some of what I wrote in the piece on some photos of the monument on Flickr. The images are from Flickr, and the quotes from Yelp reviews of the memorial.

Most monuments in the area establish a kind of formality between visitors and the monument. Many are constructed to physically remove the subject from the reach of visitors. Others, like the nearby Lincoln Memorial establish this formality through written rules about respectful behaviour, and a request for hushed voices. Nearly all of the reviews (17 of 21) focus on elements of the informality of the monument as a key component of what makes it enjoyable. The reviewers tell us to “climb all over ‘Al’” or as another suggests “sit on his lap, or kiss his cheek”. On Flickr, photographers have captured this in images of visitors picking and rubbing his nose, kissing him, or in a few cases arguing with him. While there is no posted notices which suggest that it is ok to climb him, if you stop by the monument on any summer day you will witness a queue of visitors waiting to climb up on him and have their picture taken.

An example of how groups of tourists use the memorial to stage group photos

The pictures are themselves an important element in this experience. The image above provides an example of one of the most popular kinds of images of the memorial posted on Flickr. As one reviewer notes, “everyone needs at least one picture of themselves sitting on “Al’s” lap”. As you can see from the photograph, the scale and size of the monument makes it work as a space for staging photos. The monument is so photogenic that one reviewer suggests that it “just begs you to go sit on Uncle Al’s lap and get our picture taken”. For these reviewers a central part of the experience is the informality that the monument provides. It invites them to climb him, and leave with photographic evidence of them sitting on the world’s most instantly recognisable scientist. While everyone has photos of themselves standing in front of the Lincoln Memorial these reviewers believe “Your tour of the Mall is not complete” without having your picture taken on Einstein’s lap.

Photo: Schmidt, C., 2008. Arguing with Einstein, Available at: http://www.flickr.com/photos/chrisbrenschmidt/2190660089/

It is worth taking moment to reflect on how some of the previous quotes refer to Einstein. The informality of these experiences is further communicated through a persistent use of his first name, or in some cases the diminutive form of his name, Al. This is itself a frequent component of these reviews. In using his first name, or calling him ‘Al’ the reviewers are communicating and playing with the informality of the memorial. The pervasiveness of this informality may be best evidenced in the recollections of a college student from a nearby university who ‘spent a lot of time just hanging out with ‘Al’’. The informality of the space and the fact that it is climbable leads many reviewers to discuss how it is a perfect place to bring kids. Many of the photos of the monument on Flickr show young children climbing all over him.

This level of informality is not something that all the reviewers think is necessarily a good thing. One reviewer suggests “most of the neat stuff was totally ignored by all the kids using the statue as a playground”. This reviewer goes on to suggest that the other elements in the composition of the statue, the quotations, and the map of the stars at his feet go unnoticed. From his perspective, visitors were “just jumping around”. He felt that “no one learned or read about the man memorialised”. This reviewer further suggests that it is ‘disrespectful’ to climb all over the monument, particularly, when there is no clear indication that touching or climbing the memorial is officially sanctioned by the sculptor or the National Academy of Sciences. There is defiantly credence to the questions the reviewer raises. To what extent are these visitors leaving with an understanding of the intentions behind the memorial? Certainly some visitor’s suggestions that “You can climb on the damn thing and stick pennies up his nose” take on a disrespectful tone. However, that is itself an interesting point of tension in the idea of Einstein. The more recently constructed Franklin Delano Roosevelt Memorial, which is built on a scale that would allow one to climb on him, does not invite the same kind of interaction. Popular notions of Einstein as an informal figure have translated into how people interact with the memorial. The relaxed experience Berks found in sculpting the memorial from life is very directly translated into visitor’s comments about the informality and relaxing nature of the experience of the monument.

This is just a way of saying, for those of us interested in public memory and the role of memorials really need to be watching the ways that people make use and sense of them on social media. At this point, our experiences of these spaces are increasingly going to be seen through the lens of the tweets, reviews, and photos that others have taken and shared and commented on them.

May 8, 2014