The following is the rough transcripts of a talk I gave at Fostering the Transatlantic Dialogue on Digital Heritage and EU Research Infrastructures: Initiatives and Solutions in the USA and in Italy at the Library of Congress in November 2014 (back when I worked there).
As scholars become increasingly interested in approaching digital collections and digital objects as data for computational analysis it becomes critical for libraries, archives and museums to rethink some of their paradigms for providing access to materials. Two related concepts in emergent methodologies in the digital humanities, macroscopes and the notion of distant reading, provide a point of entry for identifying the requirements for digital library platforms to support this kind of scholarship.
Josh Greenberg of the Sloan foundation described the concept of macroscopes thusly, where “Telescopes let you see far, microscopes let you see small, a macroscope lets you see big and complex.” That is, it’s about zooming out to visualize and explore relationships and patterns in aggregates and networks. Related, literary scholar Franco Morritti has famously coined the term “distant reading” to describe similar kinds of activities. In contrast to close reading, distant reading involves studying trends and patterns in things like graphs, maps and tree diagrams of features of texts. These two neologisms are part of a common trend, a push by scholars to make use of tools to explore and interpret patterns in wholes.
Parts and Wholes: Objects, Items, Aggregates, Collections
By and large, the web has been great for the item and the object in cultural heritage organizations. In hypermedia, every resource is the first resource; every item’s URL is potentially the front door to everything else. As far as Google’s search algorithms are concerned, the page for each of the individual thousand items in a collection is as important as the page about the collection they form part of. This non-hierarchical and rhizomic nature to the web, and much of digital media more broadly, has been a bit disconcerting to librarians an archivists long committed to the coherence of collections and the importance of the context of fonds.
To this end, the move to interest in macroscopes and distant reading provides a potential shift in approach to interpretation and analysis that could potentially better respect the value that comes from aggregates. That is, the parts in the whole of a particular archive or collection and their relationship to each other. Importantly, this makes it all the more critical that the structure and completeness of any given archive or collection is front and center for analysis. That is, the pattern in any distant reading of an archive is as much a map of relationships in the content as it is a map of the processes by which records were created, appraised, selected, and organized.
Three Examples for Going Forward
Data Dumps: In the emerging literature on historians use of digital collections for data analysis a common theme is to try, as quickly as possible, to download data to take it away to use it in their own tools on their own systems. Ian Milligan, who works with web archives, has refered to this as “Looking for the big red button.” To this end, whenever possible, the best first step for systems to support this kind of scholarly use is to provide easy ways for someone to export aggregate data. With this noted, with particularly large sets of data or data which is limited to various kinds of use, it’s likely a good idea to provide smaller sample sets of data.
With this said, it is important to note that data dumps are not the bulk access silver bullet that one might hope for three reasons; rights, scale and the skills necessary to make use of them. In terms of rights, many collections, particularly of modern materials, come with rights restrictions that make it impossible to provide direct downloads of full content. In terms of scale, while it is possible to allow someone to download increasingly large scale sets of data it is still the case that there are aggregates of data that require significant resources to provide access to. Importantly, in many humanities cases this kind of analysis is still possible with scales that are modest in comparison to the requirements that scientists have for working with data sets. Lastly, there is a significant skills gap around the use of working with “raw” data. That is, of the possible field of users of a data set in the humanities there is a rather small community of them who have the necessary chops to work at the command line to iron out issues and process collection and object data into processable and computable information. With that said, there are a range of projects and initiatives ongoing focused on bootstrapping humanities scholars into the required competencies to do this kind of work. To this end, there are two other primary methods for working around these three limitations that I think are promising in a variety of ways.
Sandboxes & Multi-Purpose and Purpose Built Platforms: A tool like the Bookworm, the software that powers the Google Books N-Gram viewer, illustrates the potential for two related approaches to enable scholars with limited command line chops to engage in analysis of or the similar. Set up against the derived set of n-grams, a derivative data product created from the google books corpus which notes the frequency of sequences of words in the corpus of google books, the viewer lets a user search for terms and compare their relative frequency in a corpus over time. In this case, the production of a derivative data set, the n-grams, they have side stepped the rights issues that would have occurred if they had provided raw full text access to the underlying works. To this end, the n-grams can themselves be downloaded and used with other tools. Along with that, the Bookworm platform provides a way for scholars who do not have any command line expertise to make use of the data.
There are a range of tools and platforms that I would put in this category, for example this is the kind of thing that the Hathi Trust Research Center is working to support. With this noted, it is important to recognize the limitations of these kinds of purpose built tools. In cases where one does not provide the data product underlying the tool there are clear limits to what scholars can do with the underlying data. Furthermore, the reason that google n-gram works is that considerable work was put into the preparation of the underlying dataset. In contrast, many digital collections are a bit of a mess, so it is likely that for a researcher to do sophisticated computational work with them there would be a need for them to engage in this kind of data cleanup and processing to get materials in a form fit for analysis.
Analysis as a Service and Onsite Research Facilities: Something like the National Software Reference Library, a project of the US National Institutes of Standards and Technology, models a third example of supporting this kind of computational work. The NSRL provides an onsite research environment where researchers can come in to engage in computational analysis of the tens of millions of files from commercial software in the collection. Staff in this research environment can also run algorithms created by researchers remotely and provide them with the outputs and results. In this case, with a collection of materials at an organization with particularly high concerns about limiting access to the corpus creating an onsite research space and setting up staff to run the jobs that researcher around the world create provides a solution that ensures that rights are protected while computational scholarship is enabled. In this case, the significant limitations is the resources required to stand up and staff such a research center and the fact that the process is much less immediate than the more direct ability to either manipulate some platform or interface on the web or to directly download data.
- Whenever possible, move toward providing bulk access to data. That means, ideally, exploring ways to offer downloads of arbitrary aggregates of both metadata and digital objects. Given that some of these aggregates could be massive in size, it is likely best to explore ways to queue large requests up and use things like bit torrent as a way to limit the resources they would consume. Provide persistent identifiers for those aggregates to enable dataset citation.
- Consider deriving intermediary or transformative data products, like n-grams, in cases where one cannot provide access directly to works and explore ways to create purpose built tools, like the google n-gram viewer, that can be deployed to enable exploratory analysis of intermediary products.
- In cases with particularly thorny rights situations, consider establishing in house services whereby researchers can give you their algorithms and you run them against a corpora and provide the outputs back to them.