For the last few months, I’ve been waist-deep in the Library of Congress’ MARC records — the digital cataloguing files that represent the institution’s vast holdings of books, maps, music, visual materials, computer files and manuscripts.
I’ve been particularly interested in seeing if I can extract data that is not immediately obvious in the records, or at least in using the data contained in the records in non-conventional ways. So far, this has meant looking at historical occurrences of first names, using career data to find polymaths, and reconstructing global migration patterns using birth & death records.
Earlier in the month, software development librarian and current LOC Labs staff member Laura Wrubel made this amazing tool to extract color palettes from the library’s image collections. It’s a nice way to get a sense of a collection from a distance, and it provides a visual way to compare groups of images that works well at scale. You can read more about the tool and how it works on LOC Labs’ Signal Blog.
I wondered if it might be possible to do something similar by extracting colors from the titles of works, rather than from images. More than that, I thought that color might be an interesting instrument to facilitate non-linear search through the library’s catalogue. As Charlie Lloyd put it in his stellar Eyeo 2015 talk: “Sometimes I worry that web surfing has gone out of style.” In his talk, Charlie starts points to the LOC’s Prints and Photographs Online Catalog as a place where you can still engage in the kind of meandering link following that defined the early generations of the web, easily falling down rabbit holes like early Russian color photography, early aeronautics, or the Harlem Renaissance . In my mind I thought that color might facilitate a way for a curious library visitor to explore thousands of images in a serendipitous, if inexact way.
Using a corpus of 954 color names crowd-sourced by xkcd (yes, that xkcd), I filtered out all of the titles that contained reference to a color. I then sorted these colors using a Hilbert walk to create a continuous color space. Here are the resulting palettes from Visual Materials (Prints & Photographs), American Literature, Music, and Maps:
There are some strange data-processing artifacts visible here — for example the large red band in the Maps image exaggerated thanks to titles in Spanish where the word ‘red’ (main) appears. But in large part these colour palettes speak to trends in the titles themselves: the large blue region in Music, the common use of the terms ‘black’ and ‘blood’ in American literature.
You can explore these ‘search palettes’ using this interactive tool, which I built using Glitch. If you want to see the source code, or if you are interested in re-mixing the tool, you can click on the Glitch Fish at the top right of the window. You can also find the code I used to process the MARC files and extract the colors in my LOC Github repo.
I’ll be talking a bit more about this tool, and about some other concepts for non-linear search, in Episode 3 of my podcast Artist in The Archive, which will be released later this week. You can subscribe to it on iTunes, or wherever else fine podcasts are served (the feed is also available here).