Quick, drop everything and read All Text Considered: A Perspective on Mass Digitizing and Archival Processing. It helped me think through some of what I was getting into in Implications for Digital Collections Given Historian’s Research Practices.
The abstract of the paper does a great job at explaining it’s objective, “coupling robust collection-level descriptions to mass digitization and optical character recognition to provide full-text search of unprocessed and backlogged modern collections, bypassing archival processing and the creation of finding aids.” The key point in the piece, is that it’s becoming plausible to see digitization costs as being on par with the actual processing costs of a collection. You can read this as an even more extreme take on MPLP, where digitization would potentially replace a significant part of the processing process itself. Which is exciting/intriguing for a number of reasons, one of which is as a prompt for thinking through a different kind of future for archival description and access.
The possibility of actual original order and a multiplicity of orders
Most of archival original order ends up being it’s own kind of new order. So if/when you do get around to doing some form of arrangement it’s strictly intellectual arrangement, you do so without actually moving anything. That is, if you did still want to do processing you could do it on the digital files and then provide any number of different identifiers that resolve to the digital files. In essence, the information about original order and any further arrangement would be demoted from the central organizing factor to a relevant and important piece of metadata alongside any other pieces of metadata. So you have the order things came in and the order the archivist worked out after processing. One would likely do some coarse level of weeding and deaccessioning in many cases before digitizing, but then once digitized a processing archivist would be able to further decide which of the scanned files should be kept and what the permissions for viewing the images are. From there, you just set different permissions, say onsite access, reading room only access, dark archive for x years, complete public access. You could then just work from a black list white list approach to whatever level of granularity an archive decided to process a given collection to. Not to mention, with OCRable archival material the OCR itself could be used to set up some heuristics for what kinds of materials to show to what users in what circumstances.
Linked Open Description
If the archive just commits to minting a URL structure then this process opens an exciting new future for description. That is, if every image has a URL, and the folder and collection are named in the URL (Ex http://institution.org /division/collection/series/box/folder/image ) then you (or anyone else for that matter) can create a range of descriptions and relationships of those digitized objects. If something comes in substantial disorder, Like the Herbert A. Philbrick Papers, many of which came in the trash can’s pictured here, then you just make a directory for the trash can and number the images based on the order you pull them out of the can. When you do go ahead and arrange the scans, you can do so while retaining the order they were pulled out of the trash can as a parallel set of the persistent metadata element.
The net result is that you are no longer limited by the fact that one atom is stuck in one spot. You just index the content in as many ways as you like. Much like the chaotic storage principles at the heart of the design of organizing Amazon’s warehouses you use the logic, structure and order of the database to transform the order of physical materials into something akin to the random access nature of a hard drive. The result:
- You get the benefit not being limited by the fact that a thing can only be in one place at a time.
- You are also not limited to one linear/narrative/sequential way to find things
- Anyone inside or outside an organization can then set up in house, or third party services, to let stewards/curators add any level of description to any arbitrary set of images. That is, internal and external agents could provide distinct data to organize and structure collection content, which the institution could chose to harvest and display to the extent they were interested. Since you are actually minting URL’s you could then start to watch inbound links to your items from things like citations and pull those links in as a kind of descriptive trackback.
Paralyzing or Paralleling Workflows for Archives
I think this could also help to break up much of the serial nature of workflows for cultural heritage orgs. That is, if you digitize everything and give them persistent URLs that mean things then you could have any number of processes like arrangement, description, OCR, and even processes for automated description like topic modeling run against your materials in a much more parallel fashion. If we started giving persistent URLs to these images at the beginning of our workflows instead of at the end we can reap the benefit of running any number of jobs and processes against them simultaneously. Furthermore, these could happen on a rolling basis, that is you wouldn’t need to wait for any one process to finish before moving on to another. I wrote a bit about this idea in Paralyzing or Paralleling Workflows for THATcamp leadership and a lot of these ideas came up and were discussed at CurateCamp Processing: Processing Data/Processing Collections
All Kinds of Cans of Worms Opened
All Text Considered: A Perspective on Mass Digitizing and Archival Processing opens all kinds of different cans of worms. For some kinds of materials, the prospect of digitization and OCR could make material accessible in shorter order. With that said, it throws open the doors to figure out what exactly intellectual control means in those circumstances, and what kind of further processing and arrangement one would want to do, or how to go about integrating automated techniques for summarizing and describing content an archivist might use to complement and extend their efforts to make an archive’s structure legible to their users.
I’d love to hear your reactions to some of my provocations here and any other thoughts and reflections the essay prompts in discussion in the comments.
Thanks to Jefferson Bailey, Thomas Padilla, and Ed Summers for comments on a draft of this post. They each had some great ideas and input. I hope they’ll bring some of their more extended comments into the comments here.