Discovery and Justification are Different: Notes on Science-ing the Humanities

Computer Scientist: “You can’t do that with Topic Modeling.”

Humanist: “No, I can because I’m not a scientist. We have this thing called Hermeneutics.”

Computer Scientist: “…”

Humanist: “No really, we get to do what we want, we read texts against each other, and then there is this hermeneutic circle grounded in intersubjectivity.”

Computer Scientist: “Ok, but you still can’t make a claim using this as evidence.”

Humanist: “I think we are going to have to agree to disagree here, I think we have different ideas about how evidence works.”


While watching the tweets from the Digital Humanities Topic Modeling meeting a few weeks ago I started to feel the above dialog play out. I wasn’t there, and I am not trying to pigeonhole anyone here. I’ve seen this kind of back and forth happen in a range of different situations where humanities types start picking up and using algorithmic, computational, and statistical techniques. What of all this counts for what? What can you say based on the results of a given technique? One way to resolve this is to say that humanists and scientists should have different rules for what counts as evidence. I am increasingly feeling the need to reject this different rules approach.

I don’t think the issue here is different ways of knowing, incompatible paradigms, or anything big and lofty like that. I think the issue at the heart of this back and forth dialog is about two different contexts. This is about what you can do in the generative context of discovery vs. what you get can do in the context of justifying  a set of claims.

Anything goes in the generative world of discovery
If something helps you see something differently then it’s useful. If you stuff a bunch of text into Wordle and see a word really big that catches you by surprise you can go back to the texts with this different way of thinking and see why that would be the case. If you shove a bunch of text through MALLET and see some strange clumps clumping that make you think differently about the sources and go back to work with them, great. You have used the tool to spark a different way of seeing and thinking.

If you aren’t using the results of a digital tool as evidence then anything goes. More specifically, if you aren’t trying to attribute particular inferential value to a particular process that process is simply producing another artifact which you can then go about considering, exploring, probing and analyzing.  I take this to be one of the key values of the idea of “deformance.” The results of a particular computational or statistical tool don’t need to be treated as facts, but instead can be used as part of an ongoing exploration. With this said, the moment you turn from exploration and theorizing to justifying an interpretation the whole game changes.

Justification is About Argument and Evidence
If you want to use something as evidence then it is really important that you can back up the quality of that evidence in supporting the specific claims you want to make. In the case of topic modeling, you need to make judgment calls about how many topics to look for, and you make the call about which texts from which sources go into the mix to generate your topics. If you want to talk about these topics as evidence to support particular inferences then you better be able to justify your reasons for those decisions, or be able to explain what you did with your data to warrant the interpretation you are forwarding. You are going to also need to explain how different decisions for different inputs could have resulted in different results. (I am mostly going off of the discussion in and around Ben Schmidt’s When you have a MALLET, everything looks like a nail.

The net result here, is that if you want to use the results of something like topic modeling as evidence you really need to have a good understanding of exactly what you can and can’t say based on how the tool produced your evidence. Importantly, there are a lot of different roads to go down when you start working with data as evidence, but in any event, you do need to be able to justify your decisions and defend against alternative explanations. Ultimately  this is where validity of inferences lives. Validity is always about the quality of the inferences you draw and your ability to defend against alternative explanations.

It’s the Scientists that Realized they were Humanists
At the heart of this remains some issues around what it means to do the humanities or to do science. (Fred and I got into this a bit in our Hermeneutics of Data essay).  I still hear this persistent fear of people using computational analysis in the humanities bringing about scientism, or positivism. The specter of Cliometrics haunts us. This is completely backwards.

Scientists, at least the sharp ones, have given up on their holy grail. They have given up on the null hypothesis. The sophisticated ones have realized that what they do is really just argument and evidence too. When it comes to justification time, you need to carefully build an argument grounded in evidence and defend it against alternate explanations. If you want a great recent example of this sort of argument and evidence grounded in statistics I would suggest both Nate Silver’s Simple Case for Obama as the Favorite or if you want a natural science example, read about this paper on arctic sea ice. Both are great examples of defending against different interpretations of evidence.

What you can get away with depends on what you are doing

When we separate out the the context of discovery and exploration from the context of justification we end up clarifying the terms of our conversation. There is a huge difference between “here is an interesting way of thinking about this” and “This evidence supports this claim.” Both scientists and humanists make both of these kinds of assertions. In general, I think the fear of the humanities becoming scientific is largely based on an outmoded idea on the part of humanists as to what we have come to understand happens in science. At the end of the day, both are about generating new ideas and then exploring evidence to see to what extent we can justify our interpretations over a range of other potential interpretations.

Do Less More Often: An Approach to Digital Strategy for Cultural Heritage Orgs

Everybody is trying to do too much at once. Find the low hanging fruit and pick it. Get the boxes off the floor. Release early and release often. Put things out there and find out how you should be doing things. I think this idea cuts across all parts of digital cultural heritage work. Everything from, collecting, processing, arranging, preserving, making available, and exhibiting can be re-framed in this mindset. This was the primary sentiment I put forward in my Keynote talk at the Connecticut Digital Initiatives Forum. At some point I might sit down and write this out, but I figured I would share it here.

Also, here are the slides in case you would prefer to see the presentation instead of sitting through my yammering.

I went up to talk viewshare, but was then also delighted/dismayed to be asked to give the Keynote.  I think it went well, and  I was apparently on TV across the great state of Connecticut.

Are Online Communities Places or Artifacts?

I’m sympathetic to two ways of thinking about online communities that are somewhat inconsistent with each other. The web is a stack of communication technologies (both software and hardware) and should be studied in the same way that one would study the pony express, telegraphy, or the book. Yet, the web has communities, things that through ongoing social interaction where people spatialize the communication technology to “lurk” “hang out” and talk about the other kinds of people that do things different over there.  Online communities end up feeling like places and when we interact with people who are similar in some ways and different in others in those places we end up with cultures.

The Myth of Cyberspace and Possibility of Being There

I full well realize that the web isn’t a space. I’m with PJ Ray on the entire Myth of Cyberspace.  It doesn’t have dimensions, it is a stack of technologies (hardware and software). More specifically it is a constellation of technologies assembled in different arrangements by different individuals. However that stack/constellation  clearly creates cultures. Now sure books create cultures, telephones create cultures, and the postal service creates cultures. With that said, those republics of letters, and literary cultures aren’t really the same kinds of culture that one studies in an ethnography. I mean, imagine pen-pal-nography, telegram-nography—they just sound wrong. You can talk about a republic of letters all you like, but the moment you start saying you are doing an ethnography of letters someone is going to tell you you’re doing it wrong. When you study letters you are studying documents. We study documents as a species of artifact. Yes,  we learn about culture through that study (that would be the entire idea of material culture), but we don’t think of reading letters as “participant observation.”

With all this said, I still think the idea of “netnography” totally makes sense in a way that all those other –nographies doesn’t. Something about the medium of the web (I’d hazard its’ immediacy, two-way-nature, the placey-ness of URLs as locations) ends up giving us the things that we need to think about it as a place and gives us the experiences that we need to really make cultures happen. That is, we are thrown into a thing that works like proximity to others in which we interact with them and develop some shared ways of being in the world while retaining a whole host of dissonant and contradictory feelings about things.

Putting the Field in Computer Mediated Field Work

If you are unfamiliar with the idea of netnography I would suggest Kozinets book, Netnography: Doing Ethnographic Research Online. In contrast to the idea of “virtual ethnography” Kozinets is part of a group of researchers who gets behind the idea of “netnography.” (Rightly these folks acknowledge that there is nothing “virtual” about the web, it’s a real thing). The decision to shift to use netnography instead of ethnography comes from a sense that studying online communities is so substantively different from studying them in physical space that it needs a whole different term. That is, you can study how existing communities use the web alongside other modes of interaction, but there are also communities that exist solely as a result of particular web forums, listservs, and such.

In the last few weeks I’ve read and re-read Netnography, switching between modes of enthusiastic underlining (YES! That is it!). For example, when Kozinets talks about “alteration” recognizing that in online communities “the nature of the interaction is altered—both constrained and liberated—by the specific nature and rules of the technological medium in which it is carried.” (68) However, there are other moments in which I scrawl disapproving marginalia. For example, when I see terms like “online-fieldsite” (NO, the web is not a place and we shouldn’t pretend it is!). I think I can get behind “computer mediated fieldwork,” which he uses in other places, but I’m not sure I can go to “fieldsite.”

Can we talk of “Participant Observation” when we aren’t observing people?

I’ve gone back and forth in my head about Kozinets idea that we do “participant observation” when we study interactions in an online community. How can we talk of observing participants when we are actually observing artifacts? He suggests that our actions in online communities, our clicks, our keystrokes, are effectively utterances. Which is true, but at the same time when we study those utterances it isn’t like when we experience someone talking to us, documents are being created and we are reading them. It is effectively the same as reading a letter. Still, I think those specific features of the web mediums end up making this a situation where we can get away with the “participant observation” metaphor. Yes, if a netnographer jumps into an online community and starts to engage in the ebb and flow of exchange they are doing something that may have more in common with direct participation than with the hermeneutic interpretation of documents.

Theorizing and Interpreting Kinds of Online Community Data

Kozinats discusses three types of data. Archival data (data copied from “pre-existing computer-mediated communications of online community members), Elicited data, (data co-created with “culture members through personal and communal interaction”) and Fieldnote data (the researchers  own notes, observations and self reflections). He suggests that his categories are  similar to Wolcott’s notions of qualitative researchers “watching, asking and examining” and Miles and Huberman’s focus on studying “documents, interviews and observations” as kinds of data to interpret.  These are potentially useful comparisons, and as we need to come up with ways to fit new things into old boxes to make sense of them I can get behind the impulses here.

What’s at issue here is how much the experience of participating in an online community is like participating in a communities that occupy physical space. I think this is particularly tricky in that some of the features that make the web a rather unique medium are the things that give online communities their place-like qualities. To attend to the mediality of the web is to recognize it has this set of place-like or place-affording qualities.

“Archival data” Transcript, Recording, or Encoding

Kozinats struggles a bit to explain “archival data,” not that it is data that is being collected and organized by an archive, but in the much more nebulous sense of archival that has come to mean old-stuff-that-is-still-around-for-some-reason.  At one point, he suggests that the wide availability of this archival data in previous discussion on the boards or old email threads from listservs would be equivalent of “every public conversation being recorded and made available as transcripts.” However, importantly, a listserv archive, and old posts to discussion boards are not “recordings” of what transpired, they are what transpired. The creation of the “archive” is to some extent embedded in the act of communicating through these mediums. With that said, if you aren’t experiencing these exchanges as they happen then there are going to be issues that require you to reconstruct context and make sure that what you are looking at is authentically what was created at the time you want to make inferences about. That is, people edit their posts on discussion boards, users delete their accounts and the contextual information about who they were is often erased, site administrators prune away or remove posts over time. Generally, what we colloquially call an archive with these kinds of online communities is really a pile of things that have some connection to the past but haven’t really been worked over or documented. In any event, it is critical to not take for granted that you are looking at accurate recordings of the past, but to think about the provenance and particular constellations of technologies and users that made it possible for you to look at recordings of previous interactions between members of an online community.

So what can we do with these records of discourse? Kozinats suggests that  “Archival cultural data provide what amounts to a cultural baseline. Saved communal interactions provide the netnographer with a convenient bank of observational data that may stretch back for years.” (104) I’m not sure that this works. I don’t think we can talk about this archival data as “observational data.” It is not something you observed it is a set of documentary evidence that you need to establish the provenance and context of and can then engage in interpreting in the way a historian interprets any textual records. When it isn’t currently happening you aren’t observing it. These utterances become documents as they slide out of the present and into the past.

So are Online Communities Places or Objects

I feel like the answer here has to be something like, they are objects (or specifically assemblages of hardware and software technologies and protocols) that produce place-like experiences. So, it makes sense to try and figure out what it is like to be “a redditor” or to study how redditors interact with eachother and the kinds of communities that emerge there. With that said, reddit isn’t a fieldsite. Reddit is software, a database, and a set of bits on a series of servers accessible over HTTP.

All of that stuff, those objects create and log communication in such a way that they take on place-like qualities. People lurk in some sub reddits, they build relationships with the folks they come into contact with, they develop some shared and conflicting ideas about the world. In short, people create cultures through the affordances of the technologies. That cultural component, the way people use these things, gets rolled back into changing the structure and nature of the technologies that afford the place like qualities.

A Note on Determinisms and Co-Construction

Importantly, this does not mean that they “co-construct” each other. Kozinets nods to this in the beginning of the book. The idea that the forces of technological determinism and social construction of technology have come together in a kumbaya moment where technology and culture each construct each other feels too wishy-washy. Objects and artifacts afford and resist, people interact and interpret (often drawing on their own cultural tool kits or their internal representations of generalized others) and the social or the cultural emerges through this network of actors and actants. That’s at least my best stab at this for now. So yes, it’s not an either or, but I think it’s too much of a gloss to say its co-construction

Open questions?

I’d love to hear how other folks parse out these distinctions. What kind of thing is an online community and where are the limits of talking about them as places, as cultures, as technologies and as documents? Do you agree with how I am parsing this out? Or do you think I’m way off base here?

Software as Scaffolding and Motivation and Meaning: The How and Why of Crowdsourcing

Libraries, archives and museums have a long history of participation and engagement with members of the public. I have previously suggested that it is best to think about crowdsourcing in cultural heritage as a form of public volunteerism, and that much discussion of crowdsourcing is more specifically about two distinct phenomena, the wisdom of crowds and human computation. In this post I want to get into a bit more of why and how it works. I think understanding both the motivational components and the role that tools serve as scaffolding for activity will let us be a bit more deliberate in how we put these kinds of projects together.

The How: To be a tool is to serve as scaffolding for activity

Helping someone succeed is often largely about getting them the right tools. Consider the image of scaffolding below. The scaffolding these workers are using puts them in a position to do their job. By standing on the scaffolding they are able to do their work without thinking about the tool at all. In the activity of the work the tool disappears and allows them to go about their tasks taking for granted that they are suspended six or seven feet in the air. This scaffolding function is a generic property of tools.

All tools can act as scaffolds to enable us to accomplish a particular task. At this point it is worth briefly considering an example of how this idea of scaffolding translates into a cognitive task. In this situation I will briefly describe some of the process that is part of a park rangers regular work, measuring the diameter of a tree. This example comes from Roy Pea’s “Practices of Distributed Intelligence and Designs for Education.”

If you want to measure a tree you take a standard tape measure and do the following;

  1. Measure the circumference of the tree
  2. Remember that the diameter is related to the circumference of an object according to the formula circumference/diameter
  3. Set up the formula, replacing the variable circumference with your value
  4. Cross-multiply
  5. Isolate the diameter by dividing
  6. Reduce the fraction

Alternatively, you can just use a measuring tape that has the algorithm for diameter embedded inside it. In other words, you can just get a smarter tape measure. You can buy a tape-measure that was designed for this particular situation that can think for you (see the image below). Not only does this save you considerable time, but you end up with far more accurate measurements. There are far fewer moments for human error to enter into the equation.

The design of the tape measure has quite literally embedded the equations and cognitive actions required to measure the tree. As an aside, this kind of cognitive extension is a generic component of how humans use tools and their environments for thought.

This has a very direct translation into the design of online tools as well. For example, before joining the Library of Congress I worked on the Zotero project, a free and open source reference management tool. Zotero was translated into more than 30 languages by its users. The translation process was made significantly easier through BabelZilla. BabelZilla, an online community for developers and translators of extension for Firefox extensions, has a robust community of users that work to localize various extensions. One of the neatest features of this platform is that it stripes out the strings of text that need to be localized from the source code and then presents the potential translator with a simple web form where they just type in translations of the lines of text. You can see an image of the translation process below.

This not only makes the process much simpler and quicker it also means that potential translators need zero programming knowledge to contribute a localization. Without BabelZilla, a potential translator would need to know about how Firefox Extension locale files work, and be comfortable with editing XML files in a text editor. But BabelZilla scaffolds the user over that required knowledge and just lets them fill out translations in a web form.

Returning, as I often do, to the example of Galaxy Zoo, we can now think of the classification game as a scaffold which allows interested amateurs to participate at the cutting edge of scientific inquiry. In this scenario, the entire technical apparatus, all of the equipment used in the Sloan Digital Sky Survey, the design of the Galaxy Zoo site, and the work of all of the scientists and engineers that went into those systems are all part of one big hunk of scaffolding that puts a user in the position to contribute to the frontiers of science through their actions on the website.

I like to think that scaffolding is the how of crowdsourcing. When crowdsourcing projects work it is because of a nested set of platforms stacked one on top of the other, that let people offer up their time and energy to work that they find meaningful. The meaningful point there is the central component of the next question. Why do people participate in Crowdsourcing projects?

The Why: A Holistic Sense of Human Motivation

Why do people participate in these projects? Lets start with an example I have appealed to before from a crowdsorucing transcription project.

Ben Brumfield runs a range of crowdsourcing transcription projects. At one point in a transcription project he noticed that one of his power users was slowing down, cutting back significantly on the time they spent transcribing these manuscripts. The user explained that they had seen that there weren’t that many manuscripts left to transcribe. For this user, the 2-3 hours a day they spent working on transcriptions was an important part of their day that they had decided to deny themselves some of that experience. For this users, participating in this project was so important to them, contributing to it was such an important part of who they see themselves as, that they needed to ration out those remaining pages. They wanted to make sure that the experience lasted as long as they could. When Ben found that out he quickly put up some more pages. This particular story illustrates several broader points about what motivates us.

After a person’s basic needs are covered (food, water, shelter etc.) they tend to be primarily motivated by things that are not financial. People identify and support causes and projects that provide them with a sense of purpose. People define themselves and establish and sustain their identity and sense of self through their actions. People get a sense of meaning from doing things that matter to them. People find a sense of belonging by being a part of something bigger than themselves. For a popular account of much of the research behind these ideas see Drive: The Surprising Truth About What Motivates Us for some of the more substantive and academic research on the subject see essays in  The Handbook of Competence and Motivation and Csíkszentmihályi’s work on Flow.

Projects that can mobilize these identities ( think genealogists, amateur astronomers, philatelists, railfans, etc) and senses of purpose and offer a way for people to make meaningful contributions (far from exploiting people) provide us with the kinds of things we define ourselves by. We are what we do, or at least we are the stories we tell others about what we do. The person who started rationing out their work transcribing those manuscripts did so because that work was part of how they defined themselves.

This is one of the places where Libraries, Archives and Museums have the most to offer. As stewards of cultural memory these institutions have a strong sense of purpose and their explicit mission is to serve the public good. When we take seriously this call, and think about what the collections of culture heritage institutions represent, instead of crowdsourcing representing a kind of exploitation for labor it has the possibility to be a way in which cultural heritage institutions connect with and provide meaning full experiences with the past.

Archives as Discovery Zones

I loved Discovery Zone when I was a kid. If you’re unfamiliar, it was this amazingly massive pile of kid sized tubes and ball pits. Like someone took that part of Chuck E. Cheese by the ball pit and multiplied it by a magnitude of awesome. Parents didn’t really fit up in that network of tubes, and there was a giant rope spiderweb suspended in the air. And the slides. So many slides.

"Sacha at Discovery Zone 1996" that is what it felt like

It was it’s own little world up there. You made the face that Sacha is making over there in that cc licensed picture I found on Flickr. The first time I saw an ad for it I just had to go there. I had to climb up into that wondrous world.

At this point, I think there are substantive parallels between that sense of discovery that happened there; that sense of wonder, and of exploration, and visiting an archive. Not only do I think that we discover things in archives, I think of archives as discovery zones.

What made discovery zone so cool is that they took all of this stuff that was fun to play on and then strategically and systematically organized and arranged it in such a way that it became something new. It became the discovery zone. It was an engineered place for exploratory play and imagination. The work that goes into an archive, deciding what to keep, how to arrange and organize it, how to tell folks what you have, is what makes the archive a discovery zone. Now as an adult I don’t really have a desire to climb around a series of tubes. With that said, Archives now offer me a similar sense of wounder. A place to encounter strategically organized and engineered traces of the past. Each hollinger box offering another chance to discover something we didn’t really know, or something we hadn’t thought about in the way I might think about it, or just the opportunity to touch and connect directly with a physical trace of another world.

What does and dosen’t count as discovery

This set of memories of Discovery Zone is brought to you by a recent back and forth around the discovery of a Army surgeon’s first person report from treating president Lincoln immediately after he was shot in Ford’s theater. Suzanne Fischer, suggested “If You ‘Discover’ Something in an Archive, It’s Not a Discovery.” At the heart of the post is a  wish that there might be “more articles headlined ‘Thorough, Accurate Cataloging Pays Off!'” I wholeheartedly agree, with the later statement.  To this, Ed Summers responds “Saying that there is no discovery in libraries and archives, because all the discovery has been pre-coordinated by librarians and archivists is putting the case for the work we do too strongly.” Which I also agree with; archives are places to explore and discover and celebrating moments when we find things we didn’t know in archives helps broadcast the invitation to discover.  Lastly, Helena Iles Papaioannou’s the discover in this case, has declares Actually, Yes, It *Is* a Discovery If You Find Something in an Archive That No One Knew Was There.  I’m not so much interested in the exact details in this case as I am in the broader question of what kinds of discovery happen in the archive.

Different kinds of historical discoveries
Very rarely, we can discover an object of historical importance in a place no one would have expected it to be. Hidden in the walls, out in someone’s attic, buried deep underground; these are all places where one can discover a thing. These kinds of discoveries are really exciting but most discovery happens in the exploration of materials that have been carefully organized and arranged.

In contrast, there is a kind of discovery that happens when one closely reads documents that we already knew were there but no one spent the time to extensively analyze. The tried and true example in this case would be something like Martha Ballard’s diary. People have known about the thing for a long time and kept it around but not until Laura Thatcher Ulrich turned her attention to it, with a very different frame of reference, did we all discover how amazing a text it is for understanding her life. I would call this kind of discovery finding/building knew knowledge. This kind of discovery in history is generally happens when researchers use carefully organized, weeded, and arranged collections which have been processed and taken care of by archivists, librarians and curators.

Lastly, I would add that there is a sense of personal discovery. While there is a lot that is known we all come to know it individually and to comes to mean something to us as individuals. I have spent a good bit of time in archives and I don’t think anything I’ve done has constituted discovery. Some of what I have found has been interesting enough to publish.

From Dusty Archive to Archive as Discovery Zone
So I love the sentiment behind Suzanne’s post, doing all we can to banish “dusty archive” from our vocabulary is a good idea. This is particularly true in times like this when budgets are stretched, and people are looking for things to cut it is really important that we try and foreground the critical role that libraries, archives, and museums play in gathering, organizing, exhibiting, preserving, interpreting and providing access to cultural heritage. With that said, I think the message that we want to send is better delivered as “hey guys, come inside, we have all kinds of stuff that we have organized coherently and consciously developed as coherent collections, come in and make discoveries.” Much like discovery zone, archives are these amazingly cool places where all kinds of rich historical information is at your fingertips.

This all ends me with a continued question. How do we strike the balance between recognizing, paying tribute to, and celebrating the work of those who collect, preserve, exhibit, and organize traces of the past while also making clear that there is so much that could be learned by those who come and explore?

The Crowd and The Library

Libraries, archives and museums have a long history of participation and engagement with members of the public. In a series of blog posts I am going to work to connects these traditions with current discussions of crowdsourcing. Crowdsourcing is a bit of a vague term, one that comes with potentially exploitative ideas related to uncompensated or undercompensated labor. In this series of I’ll try to put together a set set of related concepts; human computation, the wisdom of crowds, thinking of tools and software as scaffolding, and understanding and respecting end users motivation, that can both help clarify what crowdsourcing can do for cultural heritage organizations while also clarifying a clearly ethical approach to inviting the public to help in the collection, description, presentation, and use of the cultural record.

This series of posts started out as a talk I gave at the International Internet Preservation Consortium’s meeting earlier this month. I am sharing these ideas here with the hopes that I can getting some feedback on this line of thinking.

The Two Problems with Crowdsourcing: Crowd and Sourcing

There are two primary problems with bringing the idea of crowdsourcing into cultural heritage organizations. Both the idea of the crowd and the notion of sourcing are terrible terms for folks working as stewards for our cultural heritage. Many of the projects that end up falling under the heading of crowdsourcing  in libraries, archives and museums have not involved large and massive crowds and they have very little to do with outsourcing labor.

Most successful crowdsourcing projects are not about large anonymous masses of people. They are not about crowds. They are about inviting participation from interested and engaged members of the public. These projects can continue a long standing tradition of volunteerism and involvement of citizens in the creation and continued development of public goods.

For example, the New York Public Library’s menu transcription project, What’s on the Menu?, invites members of the public to help transcribe the names and costs of menu items from digitized copies of menus from New York restaurants. Anyone who wants to can visit the project website and start transcribing the menus. However, in practice it is a dedicated community of foodies, New York history buffs, chefs, and otherwise self-motivated individuals who are excited about offering their time and energy to help contribute, as volunteers, to improving the public library’s resource for others to use.

Not Crowds but Engaged Enthusiast Volunteers

Far from a break with the past, this is a clear continuation of a longstanding tradition of inviting members of the public in to help refine, enhance, and support resources like this collection. In the case of the menus, years ago, it was actually volunteers who sat at a desk in the reading room to catalog the original collection. In short, crowdsourcing the transcription of the menus project is not about crowds at all, it is about using digital tools to invite members of the public to volunteer in much the same way members of the public have volunteered to help organize and add value to the collection in the past.

Not Sourcing Labor but an Invitation to Meaningful Work

The problem with the term sourcing is its association with labor. Wikipedia’s definition of crowdsourcing helps further clarify this relationship, “Crowdsourcing is a process that involves outsourcing tasks to a distributed group of people.” The keyword in that definition is outsourcing. Crowdsourcing is a concept that was invented and defined in the business world and it is important that we recast it and think through what changes when we bring it into cultural heritage. Cultural heritage institutions do not care about profit or revenue, they care about making the best use of their limited resources to act as stewards  and storehouses of culture.

At this point, we need to think for a moment about what we mean by terms like work and labor. While it might be ok for commercial entities to coax or trick individuals to provide free labor the ethical implications of such trickery should give pause to cultural heritage organizations. It is critical to pause here and unpack some of the different meanings we ascribe to the terms work. When we use the term “a day’s work” we are directly referring to labor, to the kinds of work that one engages in as a financial transaction for pay. In contrast, when we use the term work to refer to someone’s “life’s work” we are referring to something that is significantly different. The former is about acquiring the resources one needs to survive. The latter is about the activities that we engage in that give our lives meaning. In cultural heritage we have clear values and missions and we are in an opportune position to invite the public to participate. However, when we do so we should not treat them as a crowd, and we should not attempt to source labor from them. When we invite the public we should do so under a different set of terms. A set of terms that is focused on providing meaningful ways for the public to interact with, explore, understand the past.

Citizen Scientists, Archivists and the Meaning of Amateur

Some of the projects that fit under the heading of crowdsourcing have chosen very different kinds of terms to describe themselves. For example,  Galaxy Zoo project, which invites anyone interested in Astronomy to help catalog a million images of stellar objects, refers to its users as citizen scientists. Similarly, the United States National Archives and Records Administration recently launched crowdsourcing project, the Citizen Archivists Dashboard, invites citizens, not members of some anonymous crowd, to participate. The names of these projects highlight the extent to which they invite participation from members of the public who identify with and the characteristics and ways of thinking of particular professional occupations. While these citizen archivists and scientists are not professional, in the sense that they are unpaid, they connect with something a bit different than volunteerism. They are amateurs in the best possible sense of the term.

Amateurs have a long and vibrant history as contributors to the public good. Coming to English from French, the term Amateur, means a “lover of.” The primarily negative connotations we place on the term are a relatively recent development. In other eras, the term Amateur simply meant that someone was not a professional, that is, they were not paid for these particular labors of love. Charles Darwin, Gregor Mendal, and many others who made significant contributions to the sciences did so as Amateurs. As a continuation of this line of thinking, the various Zooniverse projects see the amateurs who participate as peers, in many cases listing them as co-authors of academic papers published as a result of their work. I suggest that we think of crowdsourcing not as extracting labor from a crowd, but of a way for us to invite the participation of amateurs (in the non-derogatory sense of the word) in the creation, development and further refinement of public goods.

Toward a better, more nuanced, notion of Crowdsourcing

With all this said, fighting against a word is rarely a successful project, from here out I will continue to use and refine a definition for crowdsourcing that I think works for the cultural heritage sector. In the remainder of this series of posts I will explain what I think of as the four key components of this ethical crowdsourcing, this crowdsourcing that invites members of the public to participate as amateurs in the production, development and refinement of public goods. For me these fall into the following four considerations, each of which suggests a series of questions to ask of any cultural heritage crowdsourcing project. The four concepts are;

  1. Thinking in terms of Human Computation
  2. Understanding that the Wisdom of Crowds is Why Wasn’t I Consulted
  3. Thinking of Tools and Software as Scaffolding
  4. A Holistic Understanding of Human Motivation

Together, I believe these four concepts provide us with the descriptive language to understand what it is about the web that makes crowdsourcing such a powerful tool. Not only for improving and enhancing data related to cultural heritage collections, but also as a way for deep engagement with the public.

In the next three posts I will talk through and define these four concepts offer up a series of questions to ask and consider in imagining, designing and implementing crowdsourcing projects at cultural heritage institutions.

Crowdsourcing Cultural Heritage: The Objectives Are Upside Down

Still not the droid… By Stéfan: Our crowdsourcing conversation is upside down, much like how Calculon is holding these stormtroopers upside down.

Some fantastic work is going on in crowdsourcing the transcription of cultural heritage collections. After some recent thinking and conversation on these projects I want to more strongly and forcefully push a point about this work. This is the same line of thinking I started nearly a year ago in Meaningification and Crowdscafolding: Forget Badges. I’ve come to believe that conversations about the objective of this work, as broadly discussed, are exactly upside down. Transcripts and other data are great, but when done right, crowdsourcing projects are the best way of accomplishing the entire point of putting collections online. I think a lot of the people who work on these projects think this way but we are still in a situation where we need to justify this work by the product instead of justifying it by the process.

Getting transcriptions, or for that matter getting any kind of data or work is a by-product of something that is actually far more amazing than being able to better search through a collection.  The process of crowdsourcing projects fulfills the mission of digital collections better than the resulting searches. That is, when someone sits down to transcribe a document they are actually better fulfilling the mission of the cultural heritage organization than anyone who simply stops by to flip through the pages.

Why are we putting cultural heritage collections online again?

There are a range of reasons that we put digital collections online. With that said the single most important reason to do so is to make history accessible and invite students, researchers, teachers, and anyone in the public to explore and connect with our past. Historians, Librarians, Archivists, and Curators who share digital collections and exhibits can measure their success toward this goal in how people use, reuse, explore and understand these objects.

In general, crowdsourcing transcription is first and foremost described as a means by which we can get better data to help better enable the kinds of use and reuse that we want people to make of our collections. In this respect, the general idea of crowdsourcing is described as an instrument for getting data that we can use to make collections more accessible. Don’t get me wrong, crowdsourcing does this. With that said it does so much more than this. In the process of developing these crowdsourcing projects we have stumbled into something far more exciting than speeding up or lowering the costs of document transcription. Far better than being an instrument for generating data that we can use to get our collections more used it is actually the single greatest advancement in getting people using and interacting with our collections. A few examples will help illustrate this.

Increased Use, Deeper Use, Crowdsourcing Civil War Diaries

Last year, the University of Iowa libraries crowdsourced the transcription of a set of civil war diaries. I had the distinct privilege of interviewing Nicole Saylor, the head of Digital Library Services, about the project. From any perspective the project was very successful. They got great transcriptions and they ended up attracting more donors to support their work.

The project also succeeded in dramatically increasing site traffic. As Nicole explained, “On June 9, 2011, we went from about 1000 daily hits to our digital library on a really good day to more than 70,000.” As great as all this is, as far as I’m concerned, the most valuable thing that happened is that when people come to transcribe the diaries they engage with the objects more deeply than they would have if transcription was not an option. Consider this quote from Nicole explaining how one particular transcriptionist interacted with the collection. It is worth quoting her at length;

The transcriptionists actually follow the story told in these manuscripts and often become invested in the story or motivated by the thought of furthering research by making these written texts accessible. One of our most engaged transcribers, a man from the north of England, has written us to say that the people in the diaries have become almost an extended part of his family. He gets caught up in their lives, and even mourns their deaths. He has enlisted one of his friends, who has a PhD in military history, to look for errors in the transcriptions already submitted. “You can do it when you want as long as you want, and you are, literally, making history,” he once wrote us.  That kind of patron passion for a manuscript collection is a dream. Of the user feedback we’ve received, a few of my other favorites are: “This is one of the COOLEST and most historically interesting things I have seen since I first saw a dinosaur fossil and realized how big they actually were.” “I got hooked and did about 20. It’s getting easier the longer I transcribe for him because I’m understanding his handwriting and syntax better.” “Best thing ever. Will be my new guilty pleasure. That I don’t even need to feel that guilty about.

The transcriptions are great, they make the content more accessible, but as Nicole explains, “The connections we’ve made with users and their sustained interest in the collection is the most exciting and gratifying part.”  This is exactly as it should be! The invitation of crowdsourcing and the event of the project are the most valuable and precious user experiences that a cultural heritage institution can offer its users. It is essential that the project offer meaningful work. These projects invite the public to leave a mark and help enhance the collections. With that said, if the goal is to get people to engage with collections and engage deeply with the past then the transcripts are actually a fantastic byproduct that is created by offering meaningful activities for the public to engage in.

Rationing out Transcription

Part of what prompted this post is a story that Ben Brumfield gave on crowdsourcing transcription at the recent Institute for Museum and Library Services Web Wise conference. It was a great talk, and when they get around to posting it online you should all go watch it. There was one particular moment in the talk that I thought was essential for this discussion.

At one point in a transcription project he noticed that one of his most valuable power users was slowing down on their transcriptions. The user had started to cut back significantly in the time they spent transcribing this particular set of manuscripts. Ben reached out to the user and asked about it. Interestingly, the user responded to explain that they had noticed that there weren’t as many scanned documents showing up that required transcription. For this user, the 2-3 hours they spent each day working on transcriptions was such an important experience, such an important part of their day, that they had decided to cut back and deny themselves some of that experience. The user needed to ration out that experience. It was such an important part of their day that they needed to make sure that it lasted.

At its best, crowdsourcing is not about getting someone to do work for you, it is about offering your users the opportunity to participate in public memory.

Crowdsourcing is better at Digital Collections than Displaying Digital Collections

What crowdsourcing does, that most digital collection platforms fail to do, is offers an opportunity for someone to do something more than consume information. When done well, crowdsourcing offers us an opportunity to provide meaningful ways for individuals to engage with and contribute to public memory. Far from being an instrument which enables us to ultimately better deliver content to end users, crowdsourcing is the best way to actually engage our users in the fundamental reason that these digital collections exist in the first place.

Meaningful Activity is the Apex of User Experience for Cultural Heritage Collections

When we adopt this mindset, the money spent on crowdsourcing projects in terms of designing and building systems, in terms of staff time to manage, etc. is not something that can be compared to the costs of having someone transcribe documents on mechanical turk. Think about it this way, the transcription of those documents is actually a precious resource, a precious bit of activity that would mean the world to someone. It isn’t that any task or obstacle for users to take on will do, for example, if you asked users to transcribe documents that could easily be OCRed the whole thing loses its meaning and purpose. It isn’t about sisyphean tasks, it is about providing meaningful ways for the public to enhance collections while more deeply engaging and exploring them.

Just as Ben’s user rationed out the transcription of those documents we might actually think about crowdsourcing experiences as one of the most precious things we can offer our users. Instead of simply offering them the ability to browse or poke around in digital collections we can invite them to participate. We are in a position to let our users engage in a personal way that is only for them at that moment. Instead of browsing through a collection they literally become a part of our historical record.

The Important Difference between Exploitation-ware and Software for the Soul

Slide from Ruling the World

As a bit of a coda, what is tricky here is that there is (strangely) an important and  but somewhat subtle line between exploiting people and giving people the most valuable kinds of experience that we can offer for digital collections. The trick is that gamification is (for the most part) bullshit. You can trick people into doing things with gimmicks, but when you do so you frequently betray their trust and can ruin the innately enjoyable nature of being a part of something that matters to you, in our case, the way that  users could deeply interact with and explore the past via your online collections. What sucks about what has happened in the idea of gamification is that it is about the least interesting parts of games. It’s about leaderboards and badges. As Sebastian Deterding has explained, many times and many ways, the best part of games, the things that we should actually try to emulate in a gamification that attempts to be more than pointsification or exploitationware are the part of games that let us participate in something bigger. It is the part of games that invites us to playfully take on big challenges. Be wary of anyone who tries to suggest we should trick people or entice them into this work. We can offer users an opportunity to deeply explore, connect with and contribute to public memory and we can’t let anything get in the way of that.

Explore and Share Cultural Heritage Collections with Notes for WebWise Talk

This is just a quick post to share the slides and links from the talk I am giving at WebWise today.

The talk starts by explaining the idea behind the tool. Specifically, how making it easy to make interfaces to cultural heritage collections can help librarians, archivists, curators, and historians both better understand relationships between objects in a cultural heritage collection and how the tool can help them communicate those ideas to audiences. After explaining the kinds of interfaces you can make, I walk through a detailed example of what one of these views can do by looking at a prototype interface created by an Archivist at the National Gallery of Art to the Samuel H. Kress Collection History Database.

I wanted to make sure that everyone had links to all the views I mention. So here are all the links.

NDIIPP Partners Collections Interface:(On Viewshare) (Embeded on NDIIPP’s site): This is an interface to a collection of collections. It acts as a kind of directory for digital collections and it was created from a spreadsheet.

Fulton Street Trade Card View: (On Viewshare)
The Fulton Street Trade Card collection features 245 late 19th and early 20th century illustrated trade cards from merchant’s along the Fulton Street retail thoroughfare in Brooklyn, NY. Using a Viewshare pie chart view, the user is able to run queries and faceted search on the cards’ metadata in ways a simple catalog or scroll would not allow. Using the facets you can limit the chart to a certain element, such as business type, and then get numbers and percentages about the subjects, format, or other elements of the cards’ content.

History of Fairfax County in Postcards: (On Viewshare): A very simple view from a simple spreadsheet. If you like, you can find the spreadsheet this is based in the Viewshare documentation and work from it to get a sense of how the tool works.

Cason Monk-Metclaf Funeral Directors View: (On Viewshare): (My View on Viewshare): (Embeded on East Texas Digital Archives & Collections Site) This is one of the most interesting datasets uploaded to Viewshare. It is a set of data transcribed from historic funeral records.

Samuel H. Kress Collection History Database Prototype View: National Gallery of Art (On Viewshare) This view allows users to explore the relationships between purchase information for a work of art and other aspects of the object, including its current location. This data comes from the Samuel H. Kress Collection History and Conservation Database. The relational database documents the art collection’s acquisition, dispersal, and conservation over time and was created by the National Gallery of Art’s Gallery Archives with funding from the Samuel H. Kress Foundation. The data shared here is not complete. Viewshare data and views are intended only for preliminary demonstration of the data and should not be cited in research.

Debating the Digital Humanities Gets Real

My author copies of Debating the Digital Humanities came in today. It’s humbling to have some of my words included in such a hefty tome. I’ve been reading and enjoying it, great stuff. Beyond being a useful volume, it’s also neat to see it incorporate a selection of blog posts. The format of the book is itself an argument for how publish-then-filter can work for humanities scholarship.

It is fun and weird to see things I hadn’t intended for print in print. They have a different kind of materiality to them now. As my words ended up in two of the publish-then-filter parts of the book, I thought I might be slightly interesting to take a moment to reflect on how what I wrote ended up making its way in there.

Blogging about Course Blogging Goes to Print

I teach a digital history course at American University, this is my second time around at the course. After teaching my first incarnation of the course I wrote a series of reflective blog posts about the experience. The goal of those posts was to distill and refine my thinking about the role that public blogging can play as an instructional tool. It is particularly pertinent to the digital history course as participating in online public dialog is a core goal of the course. I was both excited and flattered when Matt asked if I would be game for including one of my posts on the course for the book. See below.

It is fun and neat to have a post end up in a book, but it is also a bit disorienting. On my blog it was part of a threaded run of posts about my teaching and writing. I like to think that everything I write here always remains a draft. Everything I write here is something I might return to and revise. Undoubtedly there will be typos in this post that someone will point out that I will fix. But now, reading the post on paper in this volume, it feels completely different. Instead of being my informal thinking out loud on my teaching it has become something much more enduring. Just look at those type faces! Such dignified serifs. It’s no longer some guys words on the internet. It’s a stake in the ground about the place of technology in teaching and learning in an emerging field. I love it.

Seeing the post in print helps further validate the point of the post and blogging in the course.  It is one thing to stand up in front of a class of students and say, “hey, this blogging thing is important. It changes how power and publishing works. So take it serious, write good stuff and write it in public so you can claim credit.” It’s something completely different to be able to say, “Oh, and when you do blog, sometimes you say something interesting enough that it warrant’s being included in a really cool book.” When I tell my students about this next Wednesday I will have gone from course, to reflection, to book, to this blog post and back to course in seven months. I for one think that is rather neat.

Day of Digital Humanities Definitions

I have one other small contribution in the book. At the end of the first section are a selection of definitions of the digital humanities that some of us provided for the Day of Digital Humanities. See mine below, again in print, in the book.

What’s funny about this is that it’s a flippant comment, a personal aside. Here is some context. When you sign up to do the Day of Digital Humanities you fill out a web form, more or less a registration form. On the form there was a text box to fill in with your definition. It didn’t say “think about this really carefully, because it might end up in a big thick book.” So what I filled in was just what came off the top of my head at the time. To this end, it is all the more jarring to read something I had to fill in on a registration form printed like this. Jarring in a good way. I’m relatively happy with my definition. I’ll stand behind these jottings. Some of the value in these definitions is that they are not diplomatic. They are the things we had on hand at the moment and there is something that is a bit more direct and honest about those kinds of comments.

Trying to do the Digital Humanities Face

In conclusion, here is my best attempt at doing the debates in the digital humanities face. I should probably have shaved before taking the picture, but there lies the perils of just being able to hit the publish button before anyone else intervenes to stop you.

Tripadvisor Rates Einstein: Traces of Public Memory and Science on the Web

Arguing with Einstein is one of my favorite photos of the Albert Einstein memorial. It encapsulates how some of the sculptor’s intentions, his argument about Einstein and science, manifest themselves in an invitation to argue with a statue. The seated statue invites us to sit on him, climb him, and argue with him, and it is my contention that sites like Yelp, Tripadvisor, and Flickr offer us the ability to explore and examine our relationship to these kinds of monuments and memorials in unprecedented ways.

Photo: Schmidt, C., 2008. Arguing with Einstein, Available at:

Its been long in the making but I am excited to report that my paper Tripadvisor rates Einstein: Using the social web to unpack the public meanings of a cultural heritage site is out in the newest issue of The International Journal of Web Based Communities. I did the primary research for this project back in my master’s program in a great course called Museums, Monuments and Memory. That was in the Fall of 2008. (I know, wow that was a while ago my how time flys in the world of academic publishing)

The paper is largely an attempt to parse out the different kinds about sites of public memory that we can tell when we draw on traditional archival collections, in this case using materials from the National Academy of Sciences archives, as opposed to the kinds of stories we can tell when we look at traces of experience and interaction with those sites of memory online. In this case, I find it particularly interesting to try and evaluate how some of the intentions in the design of the monument can be evaluated in the kinds of things that we create online as a result of experiences with the memorial. My hope is that this can serve as both further validation of the value of preserving public discourse on the web and potentially as an example for how other’s might use social sites like Yelp, Flickr, and Tripadvisor to explore and interrogate public memory.

Below is the abstract for the paper. I would love to hear any comments or critiques in the comments. Similarly, if you end up using the paper in any way I would also love to hear about it.


Near the US Capitol, in front of the National Academy of Sciences sits a gigantic bronze statue of Albert Einstein. The monument was created to celebrate Einstein and the sense of awe and wonder his work represents. However, while under construction, art critics and some scientists derided the idea of the memorial. They felt the scale of such a giant memorial did not fit the modesty of Einstein. This paper explores the extent to which perspectives of the monument’s public supporters and critics can be seen in how people interact with it as evidenced in reviews and images of the monument posted online. I analyse how individuals appropriate the monument on social websites, including Fickr, Yelp, Tripadvisor, and Yahoo Travel, as a means to explore how the broader public co-creates the meaning of this particular memorial. I argue this case-study can serve as an example for leveraging the social web as a means to understand cultural heritage sites.

If you don’t have access to the official copy I have my own personal unofficial personal archival copy that you can take a look at.