When and how did “archive” become a verb?

Archives are places. They are institutions. But to archive is also an action. Web Archiving is a process that produces web archives and personal digital archiving is a set of practices for working to ensure longterm access to personal digital content.

When and how did archive become a verb? Webster’s dates the noun usage to 1603 and the verb usage to 1831, but I’m curious how obscure the verb usage was over time.

My sense/hunch has been that the verb form of archive, is tied up in the history of computing. A tape archive is a higher latency storage mechanism. There is a long standing use of “archiving” as a concept that involves writing to tape. The term tape is itself part of the name of .tar files. So, when did archive become a verb and to what extent is archiving related to the development of computing?

This kind of question is exactly the sort of thing that Google n-gram is useful for. Over time I’ve generated a few different graphs of trends around the verb usage of archive in Google books and posted them to twitter. It seemed like it would be worth taking a few minutes to explore that data a bit more. What follows is really just some initial notes on some searches. I’m curious to get other interpretations on what we learn from these charts and examples of usage.

When the archived and began archiving

The graph below, shows trends in usage of archive in the Google books corpus from 1920 to 2000. Overall, it would appear that the term archive has seen a good bit of growth in its relative frequency in appearing in the corpus over time.

If we take out the noun form of archive and extend this back to 1800, you can see that there are a tiny number of examples of the verb forms going back all the way to beginning of the chart in 1800, but that things don’t really start to take off until the late 1960s.

The Emergence of “Archiving” 

One of the best parts of Google n-gram is that it is a book search tool as much as it is a visualization tool. That means that we can poke around and see the examples where these different usages emerge.

Below is an example of one of the first instances of the term “archiving” connected to the term “data” in the google books corpus. It’s from a 1968 appropriations hearing for a climatological data center. That places it right at the inflection point for the verb form usage of archive.

From that point out, the term archiving seems to appear primarily in relationship to computing. All of the examples of the term archiving are references to data for usage of the term in google book results from the 1970s.

With that noted, there are two examples from 1969 that involve using the term archiving in relationship to folklore.

A longer past for archiving and archived

As noted in the beginning of this post, Webster’s suggests the verb form of archive came about in 1831. There are a range of examples of “archived” that show up, even earlier than 1831, for instance the example below from 1823 or this other example from 1816.

The snippet below is one of a series of documents from the turn of the 20th century in Texas that use the terms archiving and archived that appear to largely be related to usage of the term in the “Constitutions of Texas

There are even a few other earlier examples of “archiving” that show up, like in this 1913 report from the Nevada Historical Society reports a “need of better archiving” in a heading in the table of contents.

So when did archive become a verb? 

It would appear that archive has been a verb for more than two hundred years. With that noted, it does also largely appear to be the case that the verb usage didn’t really come into broader usage until the late 1960s when it was largely associated with data and computing.

I’m curious to see what other examples or perspectives others have though. I was a bit surprised to surface some of these earlier examples of uses of archiving and archived.

Archives as Discovery Zones

I loved Discovery Zone when I was a kid. If you’re unfamiliar, it was this amazingly massive pile of kid sized tubes and ball pits. Like someone took that part of Chuck E. Cheese by the ball pit and multiplied it by a magnitude of awesome. Parents didn’t really fit up in that network of tubes, and there was a giant rope spiderweb suspended in the air. And the slides. So many slides.

"Sacha at Discovery Zone 1996" that is what it felt like

It was it’s own little world up there. You made the face that Sacha is making over there in that cc licensed picture I found on Flickr. The first time I saw an ad for it I just had to go there. I had to climb up into that wondrous world.

At this point, I think there are substantive parallels between that sense of discovery that happened there; that sense of wonder, and of exploration, and visiting an archive. Not only do I think that we discover things in archives, I think of archives as discovery zones.

What made discovery zone so cool is that they took all of this stuff that was fun to play on and then strategically and systematically organized and arranged it in such a way that it became something new. It became the discovery zone. It was an engineered place for exploratory play and imagination. The work that goes into an archive, deciding what to keep, how to arrange and organize it, how to tell folks what you have, is what makes the archive a discovery zone. Now as an adult I don’t really have a desire to climb around a series of tubes. With that said, Archives now offer me a similar sense of wounder. A place to encounter strategically organized and engineered traces of the past. Each hollinger box offering another chance to discover something we didn’t really know, or something we hadn’t thought about in the way I might think about it, or just the opportunity to touch and connect directly with a physical trace of another world.

What does and dosen’t count as discovery

This set of memories of Discovery Zone is brought to you by a recent back and forth around the discovery of a Army surgeon’s first person report from treating president Lincoln immediately after he was shot in Ford’s theater. Suzanne Fischer, suggested “If You ‘Discover’ Something in an Archive, It’s Not a Discovery.” At the heart of the post is a  wish that there might be “more articles headlined ‘Thorough, Accurate Cataloging Pays Off!'” I wholeheartedly agree, with the later statement.  To this, Ed Summers responds “Saying that there is no discovery in libraries and archives, because all the discovery has been pre-coordinated by librarians and archivists is putting the case for the work we do too strongly.” Which I also agree with; archives are places to explore and discover and celebrating moments when we find things we didn’t know in archives helps broadcast the invitation to discover.  Lastly, Helena Iles Papaioannou’s the discover in this case, has declares Actually, Yes, It *Is* a Discovery If You Find Something in an Archive That No One Knew Was There.  I’m not so much interested in the exact details in this case as I am in the broader question of what kinds of discovery happen in the archive.

Different kinds of historical discoveries
Very rarely, we can discover an object of historical importance in a place no one would have expected it to be. Hidden in the walls, out in someone’s attic, buried deep underground; these are all places where one can discover a thing. These kinds of discoveries are really exciting but most discovery happens in the exploration of materials that have been carefully organized and arranged.

In contrast, there is a kind of discovery that happens when one closely reads documents that we already knew were there but no one spent the time to extensively analyze. The tried and true example in this case would be something like Martha Ballard’s diary. People have known about the thing for a long time and kept it around but not until Laura Thatcher Ulrich turned her attention to it, with a very different frame of reference, did we all discover how amazing a text it is for understanding her life. I would call this kind of discovery finding/building knew knowledge. This kind of discovery in history is generally happens when researchers use carefully organized, weeded, and arranged collections which have been processed and taken care of by archivists, librarians and curators.

Lastly, I would add that there is a sense of personal discovery. While there is a lot that is known we all come to know it individually and to comes to mean something to us as individuals. I have spent a good bit of time in archives and I don’t think anything I’ve done has constituted discovery. Some of what I have found has been interesting enough to publish.

From Dusty Archive to Archive as Discovery Zone
So I love the sentiment behind Suzanne’s post, doing all we can to banish “dusty archive” from our vocabulary is a good idea. This is particularly true in times like this when budgets are stretched, and people are looking for things to cut it is really important that we try and foreground the critical role that libraries, archives, and museums play in gathering, organizing, exhibiting, preserving, interpreting and providing access to cultural heritage. With that said, I think the message that we want to send is better delivered as “hey guys, come inside, we have all kinds of stuff that we have organized coherently and consciously developed as coherent collections, come in and make discoveries.” Much like discovery zone, archives are these amazingly cool places where all kinds of rich historical information is at your fingertips.

This all ends me with a continued question. How do we strike the balance between recognizing, paying tribute to, and celebrating the work of those who collect, preserve, exhibit, and organize traces of the past while also making clear that there is so much that could be learned by those who come and explore?

Human Computation and Wisdom of Crowds in Cultural Heritage

Libraries, archives and museums have a long history of participation and engagement with members of the public. In my last post, I charted some problems with terminology, suggesting that the cultural heritage community can re-frame crowdsourcing as engaging with an audience of committed volunteers. In this post, get a bit more specific about the two different activities that get lumped together when we talk about crowdsourcing. I’ve included a series of examples and a bit of history and context for good measure.

For the most part, when folks talk about crowdsourcing they are generally talking about two different kinds of activities, human computation and the wisdom of crowds.

Human Computation

Human Computation is grounded in the fact that human beings are able to process particular kinds of information and make judgments in ways that computers can’t. To this end, there are a range of projects that are described as crowdsourcing that are anchored in the idea of treating people as processors. The best way to explain the concept is through a few examples of the role human computation plays in crowdsourcing.

ReCaptcha is a great example of how the processing power of humans can be harnessed to improve cultural heritage collection data. Most readers will be familiar with the little ReCaptcha boxes we fill out when we need to prove that we are in fact a person and not an automated system attempting to login to some site. Our ability to read the strange and messed up text in those little boxes proves that we are people, but in the case of ReCaptcha it also helps us correct the OCR’ed text of digitized New York Times and Google Books transcripts. The same capability that allows people to be differentiated from machines is what allows us to help improve the full text search of the digitized New York Times and Google Books collections.

The principles of human computation are similarly on display in the Google Image Labeler. From 2006-2011 the Google image labeler game invited members of the public to describe and classify images. For example, in the image below a player is viewing an image of a red car. Somewhere else in the world another player is also viewing that image. Each player is invited to key in labels for the image, with a series of “off-limits” words which have already been associated with the image. Each label I can enter which matches a label entered by the other player results in points in the game. The game has inspired an open source version specifically designed for use at cultural heritage organizations. The design of this interaction is such that, in most cases, it results in generating high quality description of images.

Both the image labeler and ReCaptcha are fundamentally about tapping into the capabilities of people to process information. Where I had earlier suggested that the kind of crowdsourcing I want us to be thinking about is not about labor, these kinds of human computation projects are often fundamentally about labor. This is most clearly visible in Amazon’s Mechanical Turk project.

The tagline for Mechanical Turk is that it “gives businesses and developers access to an on-demand, scalable workforce” where “workers select from thousands of tasks and work whenever it’s convenient.” The labor focus of this site should give pause to those in the cultural heritage sector, particularly those working for public institutions. There are very legitimate concerns about this kind of labor as serving as a kind of “digital sweatshop.”

While there are legitimate concerns about the potentially exploitive properties of projects like Mechanical Turk, it is important to realize that many of the same human computation activities which one could run through Mechanical Turk are not really the same kind of labor when they are situated as projects of citizen science.

For example, Galaxy Zoo invites individuals to identify galaxies. The activity is basically the same as the Google image labeler game. Users are presented with an image of a galaxy and invited to classify it based on a simple set of taxonomic information. While the interaction is more or less the same the change in context is essential.

Galaxy Zoo invites amateur astronomers to help classify images of galaxies. While the image identification task here is more or less the same as the image identification tasks previously discussed, at least in the early stages of the project, this site often gave amateur astronomers the first opportunity to see these stellar objects. These images were all captured by a robotic telescope, so the first galaxy zoo participants who looked at these images were actually the first humans ever to see them. Think about how powerful that is.

In this case, the amateurs who catalog these galaxies did so because they want to contribute to science. Beyond engaging in this classification activity, the Galaxy Zoo project also invites members to discuss the galaxies in a discussion forum. This discussion forum ends up representing a very different kind of crowdsourcing, one based not so much on the idea of human computation but instead on a notion which I refer to here as the wisdom of crowds.

The Wisdom of Crowds, or Why Wasn’t I Consulted

The Wisdom of Crowds comes from James Surowiecki’s 2004 grandiosely titled book, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. In the book, Surowiecki talks about a range of examples of how crowds of people can create important and valuable kinds of knowledge. Unlike human computation, the wisdom of crowds is not about highly structured activities. In Surowiecki’s argument, the wisdom of crowds is an emergent phenomena resulting from how discussion and interaction platforms, like wikis, enable individuals to add and edit each other’s work.

The wisdom of crowds notion tends to come with a bit too much utopian baggage for my tastes. I find Paul Ford’s reformulation of this notion particularly compelling. Ford suggests that the heart of this matter is that the web, unlike other mediums, is particularly well suited to answer the question “Why wasn’t I consulted.” It is worth quoting him here at length:

Why wasn’t I consulted,” which I abbreviate as WWIC, is the fundamental question of the web. It is the rule from which other rules are derived. Humans have a fundamental need to be consulted, engaged, to exercise their knowledge (and thus power), and no other medium that came before has been able to tap into that as effectively.

He goes on to explain a series of projects that succeed because of their ability to tap into this human desire to be consulted.

If you tap into the human need to be consulted you can get some interesting reactions. Here are a few: Wikipedia, StackOverflow, Hunch, Reddit, MetaFilter, YouTube, Twitter, StumbleUpon, About, Quora, Ebay, Yelp, Flickr, IMDB, Amazon.com, Craigslist, GitHub, SourceForge, every messageboard or site with comments, 4Chan, Encyclopedia Dramatica. Plus the entire Open Source movement.

Each of these cases tap into our desire to respond. Unlike other media, the comments section on news articles, or our ability to sign-up for an account and start providing our thoughts and ideas on twitter or in a tumblr is fundamentally about this desire to be consulted.

Duty Calls

The logic of Why Wasn’t I Consulted is evident in one of my favorite XKCD cartoons. In Duty Calls we find ourselves compelled to stay up late and correct the errors of other’s ways on the web. In Ford’s view, this kind of compulsion, this need to jump in and correct things, to be consulted, is something that we couldn’t do with other kinds of media and it is ultimately one of the things that powers and drives many of  the most successful online communities and projects.

Returning to the example from Galaxy Zoo, where the carefully designed human computation classification exercise provides one kind of input, the projects very active web forums capitalize on the opportunity to consult. Importantly, some of the most valuable discoveries in the Galaxy Zoo project, including an entirely new kind of green colored galaxy, were the result of users sharing and discussing some of the images from the classification exercise in the open discussion forums.

 Comparing and Contrasting

To some extent, you can think about human computation and the wisdom of crowds as opposing polls of crowdsourcing activity. I have tried to sketch out some of what I see as the differences in the table below.

Human Computation Wisdom of Crowds
Tools Sophisticated Simple
Task Nature Highly structured Open ended
Time Commitment Quick & Discrete Long & Ongoing
Social Interaction Minimal Extensive Community Building
Rules Technically Implemented Socially Negotiated

When reading over the table, think about the difference between something like the Google Image Labler for human computation and Wikipedia for the wisdom of crowds. The former is a sophisticated little tool that prompts us to engage in a highly structured task for a very brief period of time. It comes with almost no time commitment, and there is practically no social interaction. The other player could just as well be computer for our purposes and the rules of the game are strictly moderated by the technical system.

In contrast, something like Wikipedia makes use of, at least from the user experience side, a rather simple tool. Click edit, start editing. While the tool is very simple the nature of our task is huge and open-ended, help write and edit an encyclopedia of everything. While you can do just a bit of Wikipedia editing, it’s open-ended nature invites much more long-term commitment. Here there is an extensive community building process that results in the social development and negotiation of rules and norms for what behavior is acceptable and what counts as inside and outside the scope of the project.

To conclude, I should reiterate that we can and should think about human computation and the wisdom of crowds not as an either or decision for crowdsourcing but as two components that are worth designing for. As mentioned earlier, Galaxy Zoo does a really nice job of this. The image label game is quick, simple and discrete and generates fantastic scientific data. Beyond this, the open web forum where participants can build community through discussion of the things they find brings in the depth of experience possible in the wisdom of crowds. In this respect, Galaxy Zoo represents the best of both worlds. It invites anyone interested to play a short and quick game and if they want to they can stick around and get much more deeply involved, they can discuss and consult and in the process actually discover entirely new kinds of galaxies. I think the future here is going to be about knowing what parts of a crowdsourcing project are about human computation and which parts are about the wisdom of crowds and getting those two things to work together and reinforce each other.

In my next post I will bring in a bit of work in educational psychology that I think helps to better understand the psychological components of crowdsourcing. Specifically, I will focus in on how tools serve as scaffolding for action and on contemporary thinking about motivation.