Mass Digitization, Archives, and a Multiplicity of Orders & Arrangements

Quick, drop everything and read All Text Considered: A Perspective on Mass Digitizing and Archival Processing. It helped me think through some of what I was getting into in Implications for Digital Collections Given Historian’s Research Practices.

The abstract of the paper does a great job at explaining it’s objective, “coupling robust collection-level descriptions to mass digitization and optical character recognition to provide full-text search of unprocessed and backlogged modern collections, bypassing archival processing and the creation of finding aids.” The key point in the piece, is that it’s becoming plausible to see digitization costs as being on par with the actual processing costs of a collection. You can read this as an even more extreme take on MPLP, where digitization would potentially replace a significant part of the processing process itself. Which is exciting/intriguing for a number of reasons, one of which is as a prompt for thinking through a different kind of future for archival description and access.

The possibility of actual original order and a multiplicity of orders

Most of archival original order ends up being it’s own kind of new order. So if/when you do get around to doing some form of arrangement it’s strictly intellectual arrangement, you do so without actually moving anything.  That is, if you did still want to do processing you could do it on the digital files and then provide any number of different identifiers that resolve to the digital files. In essence, the information about original order and any further arrangement would be demoted from the central organizing factor to a relevant and important piece of metadata alongside any other pieces of metadata.  So you have the order things came in and the order the archivist worked out after processing. One would likely do some coarse level of weeding and deaccessioning in many cases before digitizing, but then once digitized a processing archivist would be able to further decide which of the scanned files should be kept and what the permissions for viewing the images are. From there, you just set different permissions, say onsite access, reading room only access, dark archive for x years, complete public access. You could then just work from a black list white list approach to whatever level of granularity an archive decided to process a given collection to. Not to mention, with OCRable archival material the OCR itself could be used to set up some heuristics for what kinds of materials to show to what users in what circumstances.

The container list for an archive enforces a single linear hierarchy on the contents of the archive. Each sheet of paper can only be in one folder, in one box, in one series.
The container list for an archive enforces a single linear hierarchy on the contents of the archive. Each sheet of paper can only be in one folder, in one box, in one series.

Linked Open Description

The Herbert A. Philbrick Papers in unprocessed form. Manuscript Division, Library of Congress
The Herbert A. Philbrick Papers in unprocessed form. Manuscript Division, Library of Congress

If the archive just commits to minting a URL structure then this process opens an exciting new future for description. That is, if every image has a URL, and the folder and collection are named in the URL (Ex http://institution.org /division/collection/series/box/folder/image ) then you (or anyone else for that matter) can create a range of descriptions and relationships of those digitized objects. If something comes in substantial disorder, Like the Herbert A. Philbrick Papers, many of which came in the trash can’s pictured here, then you just make a directory for the trash can and number the images based on the order you pull them out of the can. When you do go ahead and arrange the scans, you can do so while retaining the order they were pulled out of the trash can as a parallel set of the persistent metadata element.

The net result is that you are no longer limited by the fact that one atom is stuck in one spot. You just index the content in as many ways as you like. Much like the chaotic storage principles at the heart of the design of organizing Amazon’s warehouses you use the logic, structure and order of the database to transform the order of physical materials into something akin to the random access nature of a hard drive. The result:

  1. You get the benefit not being limited by the fact that a thing can only be in one place at a time.
  2. You are also not limited to one linear/narrative/sequential way to find things
  3. Anyone inside or outside an organization can then set up in house, or third party services, to let stewards/curators add any level of description to any arbitrary set of images. That is, internal and external agents could provide distinct data to organize and structure collection content,  which the institution could chose to harvest and display to the extent they were interested. Since you are actually minting URL’s you could then start to watch inbound links to your items from things like citations and pull those links in as a kind of descriptive trackback.
If everything is digitized and each image is given an ID then any number of different modes of arrangement could be minted and maintained referencing the images. Making it function much more like this distributed network. The Network by @nancywhite, CC-BY
If everything is digitized and each image is given an ID then any number of different modes of arrangement could be minted and maintained referencing the images. Making it function much more like this distributed network. The Network by @nancywhite, CC-BY

Paralyzing or Paralleling Workflows for Archives

I think this could also help to break up much of the serial nature of workflows for cultural heritage orgs. That is, if you digitize everything and give them persistent URLs that mean things then you could have any number of processes like arrangement, description, OCR, and even processes for automated description like topic modeling run against your materials in a much more parallel fashion. If we started giving persistent URLs to these images at the beginning of our workflows instead of at the end we can reap the benefit of running any number of jobs and processes against them simultaneously. Furthermore, these could happen on a rolling basis, that is you wouldn’t need to wait for any one process to finish before moving on to another. I wrote a bit about this idea in Paralyzing or Paralleling Workflows for THATcamp leadership and a lot of these ideas came up and were discussed at CurateCamp Processing: Processing Data/Processing Collections

All Kinds of Cans of Worms Opened

All Text Considered: A Perspective on Mass Digitizing and Archival Processing opens all kinds of different cans of worms. For some kinds of materials, the prospect of digitization and OCR could make material accessible in shorter order. With that said, it throws open the doors to figure out what exactly intellectual  control means in those circumstances, and what kind of further processing and arrangement one would want to do, or how to go about integrating automated techniques for summarizing and describing content an archivist might use to complement and extend their efforts to make an archive’s structure legible to their users.

I’d love to hear your reactions to some of my provocations here and any other thoughts and reflections the essay prompts in discussion in the comments.

Thanks to Jefferson Bailey, Thomas Padilla, and Ed Summers for comments on a draft of this post. They each had some great ideas and input. I hope they’ll bring some of their more extended comments into the comments here.

Published by


Responses

  1. Kevin Schlottmann Avatar
    Kevin Schlottmann

    One outcome of a “digitize first, process/arrange the data/images later as needed or desired,” (especially if it is completely transparent, though that’s not historically a strength of the archival profession — but we can change), is that users of archives, historians in particular, would be forced to confront the archival interventions that mediate their encounter with primary sources. And an inkling of this, what Jefferson’s linked piece imagines as the change in the primal archival scene, is already here – in the same issue of the American Archivist, a survey of how historians use digitized material muses that “Indeed, the possibility that certain materials are omitted from an online collection appears to be more of a concern than it is in person in the archives. The appraisal process, in other words, seems to be more transparent to online users.” (Historians and the Use of Primary Source Materials in the Digital Age by Alexandra Chassanoff) I hope that the manipulation of digitized material posited here would then prepare archivist and historian alike to consider born-digital primary sources on their own terms, such as the near-miracle of not being “limited by the fact that a thing can only be in one place at a time.”

    Like

  2. Jefferson Bailey Avatar
    Jefferson Bailey

    Nice post! [And thanks for the link]. I also like (and agree with) the Miller article for many reasons: its prepossessing tone while offering a rather radical rethinking of (what many consider) a core archival function; its metrics-driven argumentation; its encouragement to think beyond the precepts of processing and the finding aid; and its advocacy for working backwards from current methods of user search/discovery and user/donor expectations instead of taking pre-existing practices and trying to make them “web-friendly.” And that’s just to name a few. Others themes, like the fair use argument and potential integration of digitized & born-digital materials could easily spin off into their own articles too.

    For me, a couple of key quotes emerge from her piece: “… finding aids are out of sync with how many users search,” “[users] could choose to search across all archival collections or within a single collection,” and “we can partner with computer programmers on optimizing search algorithms…” If FAs feel clunky to users, if fonds and order (original or processed) are no longer prioritized elements in representation and discovery, and if algorithmic tools can empower a more familiar and effective means of querying /discovery/navigation/what-have-you, then change seems inevitable.

    The parallelism and user-contribution ideas you mention are interesting ideas. Topic models, frequency counts, word trees, percentage scores of doc/folder/box words against collection-level subject/access/authority terms, NLP/semantic-driven searches for “emotions” or “transactional functions” or something, n-gram stuff — all seem to have potential for how users interact with archival collections; and allowing users to help refine/improve the algorithms certainly adds a level of dynamism currently missing. Easy to get all brainstormy here (and easier said than done), but Miller provides a good argument and blueprint for moving from idea to practice.

    Your point about URLs is key in this process too. Kevin also makes a good point that has cropped up at a number of recent panels — i.e. anxiety about the transparency of selection for digitization — that would be alleviated (hypothetically) by this approach. Much to ponder and discuss in Miller’s paper and in your post — and my comment is too long already!

    Like

Leave a comment