Implications for Digital Collections Given Historian’s Research Practices

The new ITHAKA report, Supporting the Changing Research Practices of Historians is something that everybody working with cultural heritage collections should read. It’s full of good stuff, but in my opinion the key finding is that Google is now (by and large) the first step in historical research. Fred Gibbs and I reported on nearly the same finding in our recent paper on digital tools for historians. The Google search box is the first place historians go when they start their research, it plays a key role in their discovery process. This is particularly true for idiosyncratic terms, phrases and people’s names which often turn up results from Google books. So, the next time someone tells you that they want to make a “gateway” a “portal” or a “registry” of some set of historical materials you can probably stop reading. It already exists and it’s Google.

The report makes some suggestions for what libraries and archives should do to help make their materials more accessible. Namely, that they work to integrate them with discovery tools and that they do what they can to make more finding aids accessible online. Both of these are valuable, but I think both goals fail to fully integrate the finding about Google and Google Books. If a library, archive, or museum wants its resources to be found as part of the discovery process, the initial phase of theory development, they need to be thinking about how they get their materials (or information about their materials) to show up in Google search results.

Are more and bigger online finding aids really an answer?

The report suggests that we cultural heritage organizations should be getting more finding aids up. That’s great, that would be useful. However, given the finding about Google, I think an even bigger potential lesson here is that if you want your collections to be used by researchers (digital or otherwise) the first thing you need to think about is not finding aids but about making web pages about items, boxes, collections, etc that will be discoverable in Google. In short, I would rather see a well-structured web page with a well-chosen title and persistent URL before one even begins to make a finding aid. This is not about SEO, it’s about doing very simple things that make for better HTML pages. Importantly, if an org makes a single PDF out of a finding aid for a collection and puts it on the web that finding aid is almost useless as far as Google is concerned.

What would finding aids look like if they assumed the existence of the web and web search?

To me this begs a rather controversial question. If the goal of the finding aid is to help researchers find things and the way they do that is to search Google (which is really good at looking for particular things in HTML pages) then why is the HTML page a byproduct of the EAD XML finding aid and not the primary thing that the archivist authors? We designed an infrastructure around EAD and found ways to make that into HTML pages, but in the meantime Google came around and historians found out that Google was such a more useful and powerful way to search that they only consult the finding aids to round out the ideas they have already started developing. So, what would minimal archival processing for access look like if we thought first about creating an HTML web page for every collection or every box?

17 Replies to “Implications for Digital Collections Given Historian’s Research Practices”

  1. Hi Trevor,

    Thank you very much for your kind words about our project. Your observations about the need for libraries and archives to think about the discovery of their holdings in the context of a Google-dominated environment are exactly spot on. At a minimum, they should ensure that they expose digitized collections, findings aids, and so forth, in a way that recognizes the importance of websearch, booksearch, and other types of consumer-oriented discovery tools. This is surely pragmatic in the short-term, at least.

    At the same time, the dominance of Google is not necessarily permanent. Libraries are by no means the only parties trying to approach their management of traffic sourcing and routing with greater sophistication online. Lots of parties aim to overturn its dominance. Libraries have restarted their efforts to reclaim the discovery role in recent years through the introduction of new kinds of discovery services, and over time libraries and archives may find mechanisms by which to route users to academic sources and services more systematically. Whether these efforts are likely to succeed, and whether their success matters, are worthwhile issues for consideration.

    Thank you again,

  2. Thanks Roger, good points. While banking on Google is a bad idea, I think there are probably good reasons to bank on something Google like. That is, something that indexes the web and weights results in different ways. There are some exciting projects looking at better ways of making EAD data and finding aids more accessible, and I think they have a shot, but I still think if they started thinking about data in the very different kinds of ways that folks are thinking about exposing and making data available. With that said, our legacy data practices and approaches shouldn’t necessarly block us from coming up with newer (and potentially more light weight) ways of expressing and exposing data that are created with an assumption of the existence of the web as opposed to being adapted to the web.

  3. This is a great topic to discuss, particularly given the cost and resource issues for non-tech companies trying to build first-class web experiences. Your post and Roger’s comment also remind me of a conversation I’ve been having with Ed Summers about the idea of making HTML be the “serialization” format for data – basically acknowledging that machine-only services have a strong tendency towards disuse and silent rot and instead embracing the idea of making HTML friendlier for machines so there is no difference between what the public sees on the web and what a programmer might wish to work with. This also has some great features for routine use: even the technically unskilled have no problem simply viewing a .html file and there are a wide range of tools and utilities which don’t need to be created for minor changes.

    This also plays to Roger’s point: rather than tailoring the data for Google (or any other service) you’re only performing slight additional work to something which you already need to provide to add standard markup attributes which Google happens to support along with Microsoft, Yahoo, the W3C standards board, etc. I’ve done some experiments with microdata on WDL (see e.g. Google’s rich snippets tester) and am hoping to see RDFa mature as well.

  4. Yep, urls for everything; provide structured representations via RDFa or similar; work with the web; give unto Google what is Google’s, but then think about the meanings you want to create. It’s not that hard, let’s just do it.

    And by we, I don’t mean just institutions and repositories. Every thematic collection, scholarly publication or list of sources should do the same. We don’t need portals because every publication should be a portal, every history blog a finding aid. Let’s ban pdfs and start making better links.

    And where persistent links into a collection don’t yet exist, we can build wrappers and mint our own identifiers. We don’t have to wait, nothing is wasted. Let’s just do it.

  5. Yes, Yes, Yes 🙂 Nice post Trevor. I know SEO has bad connotations for most sane people, but I think this really is largely about Search Engine Optimization. Calling it that might help people learn more about the best way to do it.

    In addition to inline metadata, I think there are some (dare I say) best practices around robots.txt files, sitemaps, what to do when things (inevitably) move, and URL canonicalization that would be useful to document as best practices for the cultural heritage sector. Maybe it would make for a nice inter-organizational report or something…

  6. I absolutely agree that Google is the portal. Apart from the other considerations given here, search engine results are going to get better at a much faster rate than organisational search capabilities – search capabilities are their key competitive advantage, and they have huge resources to throw at the problem.

    Incidentally, Google has absolutely no problem in pulling meaning out of PDF documents (assuming you don’t mean a PDF containing a digitised image).

  7. One of the most important things to recognize about Google too, is that discovery tools stretch beyond the kinds of encouraging search results through things like metadata. A number of open access sites that end up on the top of Google searches become the main access point for the public to other websites. Wikipedia seems to be the one that often gets overlooked in academic circles. Wikipedia always tops Google searches and people love to click through its internal and external facing links. Furthermore, a Wikipedia article citing a particular resource can go a long way to public discoverabilty and accessability of information. Of course leveraging Wikipedia requires doing more then Spamming the project with LinkSpam (see, but clearly we have an obligation to think about it as a research tool about history, per Roy Rozenweig, ““If Wikipedia is becoming the family encyclopedia for the twenty-first century, historians probably have a professional obligation to make it as good as possible. And if every member of the Organization of American Historians devoted just one day to improving the entries in her or his areas of expertise, it would not only significantly raise the quality of Wikipedia, it would also enhance popular historical literacy.”

Leave a Reply

Your email address will not be published. Required fields are marked *