Transcript: Artist in the Archive Episode 6 — Be The Bit
Here’s a transcript from the latest episode of Artist in the Archive. You can listen to the full episode here.
David Brunton: 412,175,852 JPEG, 42,324,945 PDF, 11,854,378 TXT, 2,207,725 Word docs, 1,042,624 Excel files, 11,041,834 SWF, 6,722,968 MP4, 1,165,842 MPEG, 1,336,326 GZIP, 323,498 PowerPoint.
Jer Thorp: A year ago in the first episode of Artist in the Archive, I stood in front of one of the library’s huge old card catalogs in the basement of the Madison Building. I marveled both at the size of the catalog itself, I mean it’s a gigantic piece of furniture, and the scale of the library’s holdings. The number is a bit fuzzy and it’s always changing, but the total number of physical objects sits around 165 million.
Jer Thorp: What you just heard was a listing of some much bigger totals courtesy of David Brunton who we’ll hear more from later. 487,693,824 files, and that’s only a small slice of the whole, and that was three weeks before we recorded this in October of 2018. In the time since, many, many, many million more files have streamed in through the door.
Jer Thorp: The file size of all those files and the storage space required to hold them doubles every 32 months. I’m pleased to tell you that that number is going to increase again by one at some point in the near future as this very podcast that you are listening to right now, yes, these very words, these very bits will soon be added to the collection. That’s right, it’s Artist in the Archive in the collection. It’s all very meta.
Jer Thorp: In episode three, our bureaucracy episode, we followed the path of a book as it made its way from the library’s cavernous loading bays to its miles, and miles, and miles, and miles of bookshelves. Today, instead of being the book, we’re going to be the bit. We’re going to explore how digital things like say, I don’t know, a podcast make their way into the library.
Jer Thorp: We’ll talk to some of the library’s many digital specialists about how they do their jobs.
Jacque Brellenthin: Making a catalog record for a podcast.
Trevor Owens: Forming the electronic copyright deposit side of this work.
Ted Westervelt: To help with acquisitions, to help with planning, and to help with bring our policies together.
Jer Thorp: And how they wrestle some of the particular weirdness of things not made out of paper, but stored into tiny, electronic signals. I’m Jer Thorp and this is Artist in the Archive. Now, let’s get into all those bits and bytes.
Ted Westervelt: I’m Ted Westervelt. I manage eDeposit for Library Services.
Jacque Brellenthin: I’m Jacque Brellenthin, and I’m a cataloger in the US Arts Humanities Serials Division.
David Brunton: I’m David Brunton, and I work in The Office of the Chief Information Officer. My title is chief of platform services.
Trevor Owens: I’m Trevor Owens, and I’m the head of digital content management in Library Services at the Library of Congress.
Jer Thorp: Here’s what I want to do, because I think those introductions were good, but they tell us nothing. I’m going to get you to tell me what somebody else’s job is.
Jacque Brellenthin: Well, I could start because mine’s a little bit easy. I actually work for Ted. He’s my boss, and he asked me to try my hand at making a catalog record for a podcast, which I had never done before, so it required a lot of research. He let me have the reins, and helps me carve my way here at the library.
Trevor Owens: Well, I can give a different take on Ted. It’s like it’s-
Jer Thorp: Trevor’s going to give a different take on-
Trevor Owens: Six takes on Ted.
Jer Thorp: Six takes on Ted, the podcast.
Trevor Owens: We’re going to go different aspects of Ted that are significant, in that you’re a section head in a unit that does cataloging at the Library of Congress in the US part of catalogs. Ted does manage a team of folks that do cataloging in serials, but then aside from that also has been a key player in forming the electronic copyright deposit side of this work.
Trevor Owens: In part, because Ted’s division a key focus is actually on material that comes in through copyright deposit, that’s increasingly for serials has been a key spot for e-content and so that’s now almost, I think, it’s like eight years or something like that of e-serials coming in through copyright deposit. In that vein, you’ve been involved in a lot of the planning and infrastructure for how that will work.
Trevor Owens: Then, also, that spans out into all these other areas of content for electronic copyright deposit.
Jer Thorp: Trevor, what does David do?
Trevor Owens: What does David do? The core starting point for David, David is the only one of the four of us who is not in Library Services. David is in The Office of the Chief Information Officer, and in that vein he’s part of thinking about infrastructure, systems, and IT. While we tended to think in terms of the stuff and content, how it gets moved around, David is an expert in thinking about infrastructure, and engineering, and systems, platform services.
Trevor Owens: The name of David’s bucket is the frontline of the systems that we depend on, and putting them together.
David Brunton: Jacqui has a mark record sitting in front of her. When new materials come into the library, they land on the desk of someone like Jacqui, who do their work to make those collections discoverable to patrons. That is what a MARC record ultimately is really for. The cataloging of, for instance, a podcast is a relatively new practice here.
David Brunton: In addition to making it discoverable, Jacqui has to think about how other podcasts are going to be cataloged and what are the general rules for the cataloging of a podcast, and how will other catalogers interact with this process as I’ve defined it.
David Brunton: There are two parts of it that come in. One is, specifically how will this item be found? Two is, what are the general-purpose rules for doing cataloging of this kind of thing? Who gets to introduce Trevor?
Jer Thorp: Ted shall introduce Trevor.
Ted Westervelt: Trevor currently manages the Digital Content Management section, which was newly established to do a number of different things, one of which is to manage the digital general collections, but also to help with acquisitions, to help with planning, and to help with bring our policies together, because they’re fairly well scattered hither, thither, and yon across the digital space here.
Jer Thorp: Well, welcome everybody. We’re sitting in the bottom of the Adams Building in an office which is really nicely decorated. There’s lots of photographs and documents and pictures all over the walls. Maybe we can start from the very beginning. If somebody were listening and wanted to gift a digital item, how does that start?
Trevor Owens: Folks in the part of the library that Ted works in manage gifts, and so there’s a gifts email account. People email the gifts account all the time, and say, “Would the library be interested in this thing I made?” Maybe it’s any number of things. Part of the process we’ve been ironing out along with this for digital gifts is that rights are a huge part of this, and what entitlements actually come along with something.
Trevor Owens: In that case as had been done with this theoretical podcast, there’s a form to be filled out. Over the past decade or more a lot of streams of digital content have started coming into the Library of Congress. A lot of them have been set up very much to solve a stream of content, and part of what we’ve been trying to do, part of what’s core to the library’s digital collecting plan is thinking more about routine processes and workflows for acquiring digital materials that would in many ways mirror the same kinds of ways that all of this stuff comes in at scale in all these different processes.
Trevor Owens: The routine gift process I’m talking about right now is still coming together in some ways, but builds on decade or more of work that has really paved the way for systems and infrastructure to do all these different kinds of things.
David Brunton: Ted and I have worked on copyright e-journals for over 10 years together. That’s more of a river than a stream now. I think we brought in almost 30 million files through that river last year. It would be worth noting that somebody giving the library a one-of-a-kind thing is always a little bit different than the stream.
Jer Thorp: Electronic copyright deposit, is that the biggest stream of flowing in of digital? I just sounded like a marketing person.
Ted Westervelt: It depends on how you measure it.
Trevor Owens: It’s a big … How to count things is hard.
David Brunton: Well, no. How to count things is easy.
Trevor Owens: Okay. Yeah.
David Brunton: How to count items is hard.
Trevor Owens: Yeah.
Jer Thorp: Wait. Unpack that for us.
David Brunton: Things are sensible. There are a set of things on this table, and if we asked to count the number of things on this table, most of us would agree. Most of us would agree that this green pencil is a thing. It’s not that we can’t unpack this thing more.
Trevor Owens: What is that made of? How many pieces are there in the pencil?
David Brunton: If I take the eraser out, now it’s two things.
Ted Westervelt: The lead.
David Brunton: When it comes to counting archived websites or e-journals, e-journals is a great case where at a guess we’ve got somewhere north of 50 million files, and somewhere north of three million articles. Is that somewhere in that range?
Ted Westervelt: That sounds about right. Yeah.
David Brunton: Somewhere around a few thousand titles. Each of those are a way of counting, but they’re vastly different numbers for counting all the same bits. It turns out that when we run some processes against, there are some numbers on the board behind Jer, and these numbers are a result of running processes against a web archive. There are whole numbers ranging from 400 million down to 300,000 of things that are contained in other things.
David Brunton: The beauty of a non-rivalrous good, like a bit, is that we might think of it as one thing when we bring it in, but over time it may grow and proliferate into a number of …
Ted Westervelt: Yeah. To a certain extent I think it’s difficult or perhaps unhelpful to try and conglomerate everything together. There are major streams. There are minor ones, and it really is innate into each of those the best way to describe it and think of it. I tend think if it’s apples and oranges, it’s all fruit, we want all the fruit together, but if you think about it in this way, if you look at, and it could be based on how we’re getting it, too.
Ted Westervelt: Within the e-journals we’re collecting, there are small ones we’re getting from essentially the equivalent of almost gift, the little people sending stuff that are producing one journal and they’re very dedicated to it. There are the major [inaudible 00:11:23] publishers just pouring stuff in through the spigot essentially. Then that’s big and it’s growing, it’s large.
Ted Westervelt: Then you think about web archives in a different way, because it’s a different thing. You’ll think about sound recordings in a different way, because it’s a different thing.
Trevor Owens: I think the weirder thing is it just keeps getting weirder, which that’s a terrible turn of phrase, but I’ll stick to it. I think we’ve already underscored that it’s hard to conceptualize this stuff even with materials that we’ve been working with for a long time. When you get into something like web archives, what’s fascinating is that there’s a lot of audio files in our web archives.
Trevor Owens: Some of those files probably come from podcasts, but they were acquired as part of a website. Even looking at the numbers of files in web archives gets really challenging, because some of them very clearly are … You would say to someone is this CSS file a thing? Then they’d be like, “What is this? I don’t even understand how to …” It doesn’t make sense except as a constituent part of another thing.
Trevor Owens: There’s a bunch of webby stuff that hangs together, but then there’s four-hour-long audio recordings in there. Those are actually made up of a bunch of constituent parts.
Ted Westervelt: I tend to think of, the more actually I talk to Trevor and think about the web archives, it’s sort of like we had this idea that we would just collect en masse, and really the web archiving is collecting en masse, and that you manage or describe or think of it at a very high level, the archive. Then, when you crack it open, then you see what’s in there and see if there are individual parts of it that deserve specialized attention within it. I think that’s the exciting part of it.
Jacque Brellenthin: Well, I just want to add onto that, then the next step from there is once you have all this stuff gathered, you’re going to have people who want to find it, and you have it and you want the people to find it because you have all these great materials.
Jacque Brellenthin: Then there was the first question that I asked when I was asked to catalog a podcast was, “What sort of level are we going to catalog this?” By episode, because every episode has a very different topic, but the podcast as a whole really doesn’t. It’s about different processes, what the Library of Congress of does.
Jacque Brellenthin: My first question was, “How are we going to catalog this?” Are we seeing this as one podcast or as many individual episodes? Where will we catalog it?
David Brunton: It’s so wonderful to hear you asking that question, because you’re little the person I would ask.
Jacque Brellenthin: Because it comes across your desk and you figure people that are looking for this within a library catalog are going to find it at its highest level. I see it as opening the door. They’re going to click into the webpage itself where they will then find and be able to search for different topics within the podcast.
Jacque Brellenthin: It’s which door do you want to open? Do you want to open the door at a very episodic level where they go right to a specific episode, or do you want to open the door to a larger whole, where they can then interact with the podcast, go to the different episodes, maybe link out what other things in the library are pertaining to this specific episode and all that sort of stuff.
Jacque Brellenthin: My first question is always how are we viewing this particular item. Do we want to be very granular or very broad? What does the end user want? What does the patron of the library want?
Jer Thorp: I wonder if the challenge in cataloging digital materials is really similar to the challenge of cataloging manuscript materials.
Male: Yeah. It is.
Jer Thorp: Those manuscript files contain everything else the library holds. They contain photos and maps and books, and this idea that you all are faced with brand new formats and how do you manage those brand new formats.
Trevor Owens: I think one of the challenges with that too is that the new formats are also this continual remediation of the old formats into conceptual models and frames. In that vein, it prompts these interesting conversations about to what extent is something enough like something else that it should go there, and to what extent is it distinct enough that it should be treated differently, and how to connect these things.
Trevor Owens: It also has, I think, interesting aspects that follow things all the way through the end-to-end process, so that there’s selection decisions that get made about material when an offer has been made, or in any of these streams coming in. Those are also similarly done at aggregate levels or at lower item levels, and then similarly how those materials get processed and stored and managed, how they get made available. All those aspects end up getting tied to the conception of the objects.
Jer Thorp: I love the point you’re making that thus far if we talk about digital materials, and specifically things on the web, we’re really remixing. We’re remixing available protocols and stitching them together, such that we might think of a podcast as a new thing, but it’s an audio file, sometimes bound to an XML file that defines its assets and so on, and so on.
Jer Thorp: Twitter, we think of that as a brand new thing, but it’s just text sometimes bound to images and links. Do you think about a different future where something will arrive on your desk in digital that maybe is a fundamentally new thing? Does that keep you up at night, or does it make you excited?
Jacque Brellenthin: Well, I’m excited about-
David Brunton: I just got goosebumps.
Jacque Brellenthin: I’m excited about that, because as cataloging it’s actually very difficult sometimes when a podcast or something comes across the desk that fits into several different … A sound recording, a serial, and you’re actually attempting to catalog it in a system that was originally designed for tangible, single books, or, as we like to call them, monographs.
Jacque Brellenthin: Then you’re trying to almost fit a square peg into a round hole, because you want to cover multiple bases. There’s not really a great system, at least for cataloging, that you can really fit this into. You have to check many boxes and then go from there. I would love for one day the systems to all come together, where when you come across a piece, be it a web archive or a podcast or a book or a magazine, that you don’t have to live in a world where you have to fit it into these templates that have existed for so long.
Jer Thorp: What’s the biggest challenge when you’re trying to take a born digital object and map it to that?
Jacque Brellenthin: The biggest challenge for me is not what is the podcast about or deciding what is the title. That’s very basic data. It’s about where is it going to live. Will it live with the audio files? Will it live on its own, and how can I put that in a record so people can find it? If they’re using a facet button, what if they facet to sound recordings, but we’ve housed it in serials and they don’t talk to each other, or vice versa?
Jacque Brellenthin: How is it that the decision that I’m making is going to affect people who are specifically looking for a podcast?
Trevor Owens: Where that object gets placed in the grand network map of content makes a difference.
Ted Westervelt: The templates are built that way, but essentially what we want to do is make sure that there’s metadata associated with it that it is a sound recording and it is a serial, and it’s a digital resource. Thereby allowing people to discover it, which is the whole point.
Jacque Brellenthin: From different ways.
Ted Westervelt: From different ways. Right. No matter which angle you’re coming at it, because everyone has their own perspective when they’re coming in on this, is they can still find it. You make sure the paths are all leading to where they need to be.
Jer Thorp: I have this weird somewhat guilt that this podcast is getting so much attention, because I know there are objects that come in that just maybe get one. The catalogers are busy, and they’re like, “Insert small, short description,” and then off it goes into the collection. It only has that one hope. I always think that the more that a cataloger can add to it, the more hope it has of being accessed.
Trevor Owens: I think an interesting aspect of this that is something we’re doing a lot of, which this fits in with in many ways is that the organization is really set up around trying to figure out what the conveyor belts are that things need to fall onto. In this case I think this is one of these, like one of the things that’s going on in conversations about this is trying to tee up what primitives it should be associated with that it can then demonstrate ways forward.
Trevor Owens: One other thing I want to throw out, because you had mentioned earlier is there a thing that doesn’t fit into a lot of the primitives or the frames, and I think one of the ones that Ted and I are in a working group on, datasets.
Jer Thorp: Our audience will like this discussion.
Trevor Owens: The Library of Congress’ digital collecting plan identifies datasets and aggregations of content as one of the focal areas for the future of collecting.
Jer Thorp: This podcast arrives at the library. It’s accepted as a gift. Somebody brings it in on a USB key, and then ultimately electronic copyright. These files come in through a pipeline.
David Brunton: That’s right.
Jer Thorp: Now, you hold the digital file. Where does it go?
David Brunton: On that-
Jer Thorp: You can get really nerdy.
David Brunton: On that USB key, that USB key has represented this all in bits, and I think that’s one thing that it’s easy to lose sight of when we’re talking about all the different forms of a thing. It is actually the very first step that we take is to do a checksum of the bits and to fix those. We call it a fixity. We fix them and we keep a digital signature of those bits in order to keep track of them hopefully for the life of the bits, even when we have a derivative later.
David Brunton: Somebody plugs that USB key into a machine, and then they start using software. In the software we scan it for viruses and then we make a copy of it. Some, turns out to be a small percentage, but some percentage of the time when we make that copy an error has already occurred. Every single time for the life of those bits at the Library of Congress, every single time we do that we check the copy against the fixity that we previously saved.
David Brunton: If the error occurred when the patron, when the giver of the gift was copying it on to the USB key and they didn’t include a checksum when they gave it to us, then it’s possible that there was an error that we didn’t know about, but at the time we received it we have a record of the bits. It is notable to point out that a lot of the larger streams of content coming into the library do have the fixity arrive with them.
David Brunton: We can actually push the provenance even a little further upstream than our receipt.
Jer Thorp: They come with a certificate in the way that if you were to donate a Picasso to the library, then you would accompany it with a certificate.
Male: Authenticating it.
Trevor Owens: Yeah. It’d be like an MD5 hash for the Picasso.
David Brunton: Did Picasso do that, MD5 hashes?
Jer Thorp: We’re pretty sure.
David Brunton: Yeah. That’s [crosstalk 00:22:55].
Trevor Owens: He may have been even into SHA, like 256 or something like that.
David Brunton: Which is kind of our default now. We do a higher level of check summing.
Jer Thorp: Because the amount of content that you’re consuming, you do get into a hash collision space, where things could actually.
David Brunton: Which is extraordinary, so the overwhelming majority of our hash collisions are intentional, meaning that the-
Trevor Owens: It’s the same thing.
David Brunton: Yeah. Trevor has some large number of transparent GIF images in the web archiving collection, and many of them are bit-for-bit identical. Those are intentional collisions, but an unintentional collision is that the bits are different, but the MD5 hash is the same.
Jer Thorp: Do you want to try a succinct explanation for what a hash collision is for our listeners?
David Brunton: John H Conway had a column back in the 60s about it was mathematical diversions. He described this technique called casting out the nines, which was a very, very old form of check summing, where you look at a number and you take all the nines out, and then you add the digits, and then you take the nine out of that, and then you add the digits again, and you take the nines out of that until there are no more nines left.
David Brunton: That number at the end is a signature for the line of numbers. Check summing is very much like that. If you imagine that you’ve reduced every line down to 10 digits, 0 through 9 at this point, every single one of them has this checksum of the line. There are only 10 possible checksums even though there are a very large combinatoric number of lines. The idea is that this helps you unintentional errors, not intentional errors.
David Brunton: You could construct a line that fooled the check summing algorithm, and it’s very easy to do with an MD5 checksum. Unintentionally though, usually if you check the hash against the hash that you’ve stored, the odds are that if they match there’s no one bit that has flipped. In the checksum collisions that we’ve had, we’ve had a couple of times where a very large file and a very small file had the same MD5 checksum, which is mathematically possible, just quite improbable.
David Brunton: Once you get into the hundreds of millions of files, it becomes more and more probable, and now it’s like certain, and in the past. When that happens, one of the things that we have to do is have a mechanism to identify, “This collision is an unintentional collision, not an intentional one. These two files aren’t the same.” The check summing algorithms have evolved over time, as computational power has increased, to have more possibilities.
David Brunton: The numbers themselves are longer. It would be as if instead of just having a one-digit number at the end of a line, you had a two-digit number, and now there are 99 possibilities before you collide instead of just nine.
Jer Thorp: I had a real-life hash collision the other day.
David Brunton: Oh no.
Jer Thorp: Well, no. It’s not that bad, because we were kind of …
David Brunton: You said collision, it sounds bad.
Jer Thorp: … poorly hashed in that the names that we use have a very likelihood of a collision.
David Brunton: Did you find another Jer Thorp?
Jer Thorp: You’re David, David is a very common …
David Brunton: [crosstalk 00:26:41].
Jer Thorp: You probably get hash collisions a lot.
David Brunton: Many times per day.
Jer Thorp: I ordered a bubble tea, and I ordered this matcha horchata bubble tea, which I really love. I gave them my name. Then they called it, and I went to reach for it, and it was another, my full name is Jeremy, so it was another Jeremy who had ordered exactly the same thing I had. We had a physical hash collision. Our hands met at the matcha.
David Brunton: Who-
Jer Thorp: If we as a society decided that was a problem we wanted to fix, then we could come up with a naming system where there were more names. With the amount of things the library holds, your naming system has to be very, very … There has to be a lot of space for names, otherwise those collisions are going to happen more often.
David Brunton: That’s right.
Trevor Owens: Well, it’s worth underscoring too that name authority is a whole universe at work in cataloging, and that they’re adding dates to people’s name. It goes in every direction.
Jer Thorp: I’ve done some work with the name authority files, and I think we’ve even talked about them with the podcast before, and that’s another question.
Jacque Brellenthin: I made your name authority file. I might have to actually have you look at it and see if you’d like to add anything to it.
Jer Thorp: Awesome.
Jacque Brellenthin: Like I need to add the name Jeremy because I called you Jer in it. If you’d like to go by Jeremy, there’s still time to fix it before-
Jer Thorp: The MARC name authority and my mother will be the people that call me Jeremy. Now, we’re going to jump back into that conversation in a minute. We’ll dig a little bit deeper into how digital things are actually stored and served up to library users. We’ll also wade into some of the murky ethical waters of digital collection. We’ll talk specifically about the challenges that the people at the library face when they set out to collect materials from social media.
Jer Thorp: Before that, I want to talk about a new and extra cool that the team at LoC Labs released in October 2018. If you’re a regular listener, you probably remember an interview I did in episode two with Michelle Krowl about an unusual set of documents the library holds from just after the Civil War.
Michelle Krowl: Today I have to show you two entries from a left-hand penmanship contest. It’s part of the William Oland Bourne papers here in the manuscript division. William Oland Bourne was a chaplain at Central Park Hospital during the Civil War, and noticed people who were naturally right-handed were having to retrain themselves to write with their left hands. He would have them autograph his autograph books and note that they were using their left hands.
Michelle Krowl: When he was the editor of The Soldier’s Friend newspaper towards the end of the Civil War and then into the postwar period, he had the newspaper run two series of a left-hand penmanship contest, so that Union soldiers who had been wounded during the war and lost the use of their right hand or their right arm, and had naturally been right-handed before the war, they could send in samples of their left-hand penmanship and compete for cash prizes.
Jer Thorp: These letters written by hand by Civil War vets who’d lost a limb are so full of humanity. Often they tell stories of war experience and other times they just speak about how hard it has been to adjust back to non-war life, the day-to-day lives of people who have been dramatically changed by the battlefield.
Jer Thorp: Up until now these letters have only been accessible to those of us who’ve had the time and patience to read the physical documents or the scans of them which are made available. Now, the information in the letters is still coded in long-hand script, which makes them kind of useless to machines which have a hard time with handwritten text, or to the visually impaired, and, quite honestly, to a lot of us who maybe haven’t read a longhand letter in a while. That’s changing fast.
Jer Thorp: On October 24th, 2018, the library released crowd.loc.gov, a platform to crowdsource transcription of documents from the collections. It’s a really cool platform. It’s outsource, yay, and it promises to liberate a whole lot of information in the archive that has until now been hidden away, bound up in cursive scripts or in strange paper formats. One of the collections we’re just starting with is the William Oland Bourne papers and the left-handed penmanship contest.
Jer Thorp: Crowd.loc.gov is a kind of bridge between the physical and the digital. It’s a way to connect deep histories to new technologies. I’ve personally already spent hours getting lost in the stories from the Bourne papers, and I suspect that if you give it a try you’ll find plenty of wondrous things that you can help bring to the surface.
Jer Thorp: Now, let’s get back to our conversation with Ted, Jacqui, David, and Trevor, and we’ll learn more about how the library manages its remarkable digital holdings.
Jer Thorp: Let’s round the picture, because we’ve talked about an object arriving and we talked about it getting a catalog record, and we talked about it coming off that USB drive and going into storage with this checksum that makes sure that it’s the right thing. Now, if I go onto the library’s website, and I request this item or any other item, what happens?
Jer Thorp: I want to preface that by saying what I know doesn’t happen, which is that the library does not actually have a webpage for everything that it holds, like sitting in storage somewhere waiting for you, and that when I request an item, just like what happens if I walk into the part of the library with books and I ask for a book, they don’t bring me to a table where that book just happens to be out there. Somebody goes and gets it on a cart and brings it through the tunnels. Some time later your book arrives to you.
Jer Thorp: The process is faster, but sort of similar, right? Do you want to talk a little bit about that?
Trevor Owens: There’s a few different modes by which access is enabled to digital collections of the library. The e-journals that we were talking about, that are part of copyright deposit, have only two simultaneous on-site users can access them. If you do searches that find your way towards material that is only available that way at the library, then you’ll end up seeing a note that you should go to the physical location where you can access those materials.
Trevor Owens: In a lot of cases those are e-journals that we also have access via publisher platforms, and so those if you’re doing searches in the e-resources catalog you’ll just resolve immediately to where you can read the document when you’re here on site, because we know. IP range-related things is how those work. The openly available digitized material is the bulk of the way that people think about, I think, library’s digital collections around the country.
Trevor Owens: It’s not necessarily the bulk of the way that researchers think about it who come here on site to do research with the collections. The website is the means by which people get access to those sorts of things. A good example would be the archive websites that we have. If you do a search for Giphy or something like that, you will find the Library of Congress record page about Giphy, and then you can click into it, and then go into our instance of the Wayback Machine, where you can then browse through the archived copy of the site as presented online.
Trevor Owens: Increasingly there’s a lot of work on trying to get those discovery systems to be better connected to each other and integrated to make it as easy as possible for a user to find what we have. I think as you’ve seen in your own explorations here, there’s so much and there’s so many differences in how these materials function and to express that in a meaningful way is a challenge. That, I think, is one of them.
Trevor Owens: There are a few different sets of discovery systems that people use to find things and get access to them.
Jer Thorp: I think there’s thinking about more ways, right?
Trevor Owens: Yeah.
Jer Thorp: We talked about datasets before, and I think there’s a lot of thinking in this institution about how the digital materials are available in batches for people who might want to compute across them. Is that something you’re thinking about directly right now?
Ted Westervelt: Part of the challenge of this institution is we’re balancing the needs of our patrons with the needs of the rights holders. It makes access something that we want to maintain while also balancing that out. Since, let’s be honest, the business processes are in flux and a lot of people are freaking out, and we have to be very careful and considerate about it. It’s very nice when someone gifts to us and we don’t have to worry as much about it, but not everyone’s in this same boat.
Ted Westervelt: I’m a little more hopeful that it may be possible to the metadata about an item or high-level data mining/text mining may be something we could provide more of across a collection or whatever we’re calling a collection than an individual item. This is just my thinking. In any case, I’m hoping.
David Brunton: I think Ted’s point is a good one specifically about secure access to rights-restricted content is hard. For the case of whether it be a journal that somebody’s deposited, or a gift that somebody has given us with some restrictions attached to it, which is something that happens frequently, the telling someone that we have it, providing some capability of authenticating the thing that someone’s looking at, providing some capability to understand what rights the using of the thing at the library confer with it is really more complicated.
David Brunton: Even what are you giving? There’s the delivery of the bits, but you are also delivering bits that you are not giving to the library, frequently. You probably don’t own the rights to the codec in your MP3 file that the library’s receiving. You probably don’t own the rights to any associated software that might be coming on that. Learning to be explicit about what the gift is, is a process that I’d say we’re in the beginning of.
David Brunton: We talked a little bit before about it’s a non-rivalrous good. If we weren’t careful and we said, “We’re going to put this thing on the internet,” and we either didn’t have the entitlement to do it, or the giver of the gift hadn’t considered the possible outcome of that, the Library of Congress can inadvertently usurp the role of the publisher, and inadvertently usurp the role of the original, where that item has originated.
David Brunton: Could inadvertently usurp the role of a local public library, who may also hold the item. That’s not a good place for the library to sit. We don’t want to suck the air out of the room. Ted and I are exchanging a meaningful glance, because many of the collections that we bring in through copyright deposit, we take a copy of it. We get our MD5 checksums. We get our list of articles out of it, but when you come to the Library of Congress and look at the article, you actually might look at it in the system that is built by the publisher. That’s a combination of what you’re used to, what you’re accustomed to.
David Brunton: It comes in as a special relief agreement that the publishers agree to grant us this access as a part of the copyright deposit process.
Trevor Owens: On that vein, I think a core point with that, that I think comes through really well in the new strategic plan is this idea that creators are one of the central user bases that we serve, and that that means people who make work that gets published, and it means people that make work that’s based on our collections, and that there’s this relationship between materials in those two cases, but that that’s a great cycle that we’re excited to support.
Trevor Owens: I think it’s also the central thing to be thinking about how are we supporting creator and creator communities in those cases who are served by the copyright office and by being a part of the collection.
Jer Thorp: You just opened the door to a topic that I want to talk about, which is social media. I think there’s something very interesting about the idea of social media existing in collections, and it connects very directly to what I think about, about rights. For a long time this institution collected tweets, and that data for you and I and a lot of us includes our tweets.
Jer Thorp: Although we did click a “okay” in the Twitter terms of service, I think there maybe is an argument that I never understood that my tweets would become part of the record of an institution. When you’re collecting anything where you have social media, what are you thinking there now, and is there a delineation to be made between archival and collection, and what does that mean?
David Brunton: I think this is just the kind of sticky issue that we should be spending a lot of time on. I don’t know if I agree that it’s particularly different than the sticky issues that we previously ignored. I don’t know that it’s different than the holding of a newspaper that has a record of your actions as a juvenile, or holding a …
Jacque Brellenthin: An opinion column.
David Brunton: An opinion column.
Trevor Owens: Letters from children to scientists in the manuscript collection.
David Brunton: Right. I think newspapers are a great example of this, because a newspaper is an example of something where even the holding of the copyright, it’s a special case. Just because a logo of a corporation was published in a newspaper, doesn’t mean that the newspaper publisher owns the rights to that logo. When the newspaper publisher transfers it to the Library of Congress, that right to republish that logo in a particular context doesn’t.
David Brunton: I think social media … In 2010, when I first got involved in the collecting of the Twitter data, it wasn’t clear to me or to the profession, or to, I think, the public, writ large, what kind of thing it was. Is Twitter an item? Is Twitter a collection of items? Is it a serial? Is it a way of talking to each other?
Trevor Owens: Was it published?
David Brunton: Was it published?
Ted Westervelt: Mm-hmm (affirmative). [crosstalk 00:41:18].
Jer Thorp: Jacqui’s imagining right now [crosstalk 00:41:19] having to catalog [crosstalk 00:41:20] Twitter.
Jacque Brellenthin: By person, by thing, by company.
David Brunton: Is it …
Jacque Brellenthin: Just Twitter.
David Brunton: Interestingly, even at the time in 2010 the decision to collect the Twitter archive was actually not made in the collecting part of the library. It was made in the now defunct office of strategic initiatives by a person really on the technology side of the library, who viewed it as a opportunity for the library to be forward-thinking. As with all opportunities to be forward-thinking, it was a mixed bag.
David Brunton: I’ll give an example of a way in which I think it was really positive. When we brought in the historic archive, which is the 2006–2010 gift from Twitter, it was all public tweets. I think there were 20 billion tweets that were spread across a whole range of files. Subsequent to that, the tweets that we got from Twitter were sent to us on an hourly basis, one file per hour.
David Brunton: We decided to reprocess the old tweets so that we’d have a one-hourly file for each. There were 30,000 hours. We ran 30,000 of our automated collections workflows on these. It broke everything, because it was 30,000, and we were used to doing 12 a day. It was a phenomenal way to learn, “Oh, hey. We’ve got some scale problems that are in our present, not just in our distant future when we add three orders of magnitude onto the scale of something.”
David Brunton: I think the thinking about an item is another great example, because when you talk about how you’re going to do it.
Jacque Brellenthin: Cataloging Twitter sounds like a nightmare. Maybe I’ll be at the point someday where I can let someone else try.
David Brunton: If you think about a tweet as a thing, there are a trillion of them. There are fewer files than that, because we put lots of tweets into a file, and there are fewer deliveries than that because we put lots of files into a delivery. There are fewer datasets than that because all of these deliveries make up a dataset. How somebody is going to access it, the library’s made the decision that these have gone into essentially a dark archive here, which means that scholars and patrons don’t have access to them at the Library of Congress.
David Brunton: When we think about the future, the question about how and whether people will want to access these materials is … I’m glad-
Jacque Brellenthin: We’d need subject headings for hashtags.
David Brunton: I’m glad you have to think about it.
Jer Thorp: I want to bring this full circle. I have a lot of friends who make really great Twitter bots. For example, my friend Allison Parrish made the every word bot, which for years listed literally every word in the dictionary, and got a lot of attention. Allison could gift that to the library?
Trevor Owens: Well, it could be offered as a gift. The process by which determinations are made if somebody should be added to the collection involve, as a central component of it, recommending officers who are subject specialists who weigh in on potential things that come through. Even when you’re looking at some of the stuff even with journals and books there’s a selective process involved in, “Is this material relevant to these subject-based areas of whatever collection?”
Jer Thorp: It’s fascinating to me that there’s a much different tilt in intention there though, that if [inaudible 00:45:15] were to say, “Hey, I actually want to donate those tweets.” There they are in a file. We’re going to give them to you. Then they would go through a similar process that we describe with a much different intentionality. To me it’s an interesting difference, and it speaks to some of, again, this difference in my mind between archival and collection.
Jer Thorp: Where Twitter archives its whole thing, and for a while you all were essentially mirroring that archive. To change it to collection, it is like let’s think about what parts of Twitter are particularly important. Let’s approach the people who made those parts and ask them.
David Brunton: I love that approach, and I think it mirrors the way that the Library of Congress collects web archives.
Ted Westervelt: I think it gets back to there are a lot of things we probably could do, but it’s also we have to remind ourselves we are the federal government, and we have to be very concerned about the people of the United States, at least, and really the world is that when we don’t seem like we’re just lumbering across a landscape sweeping stuff up, and doing it because we know what’s right.
Ted Westervelt: It’s very much engaging with them. [inaudible 00:46:24] Trevor have talked to this point before, is that, and that’s the challenge with the Twitter archive is that, you’re right, we all use social media, and we all on some level know that they belong to Mark Zuckerberg, or Jack, or whomever the thing, but on the other hand we don’t really think that. We don’t behave like we’re doing that. We need to consider that when we’re doing it.
Ted Westervelt: If we do want, if some social media is important to us, then we do need to reach out. We need to be good partners and explain to individuals [inaudible 00:46:58]. I think that’s important as we go forward, and I think everyone agrees on that.
Jer Thorp: One of the beautiful things about this process here has been to talk to people about the objects that live in the library that they feel a particular affection to, and listeners have felt this, this kind of love for it. I wonder if you can talk about something that lives in the digital archive that you feel an affection for, whether it be because it’s an object that’s meaningful for you, or because of the experience you’ve had with it. Trevor.
Trevor Owens: I, a few years ago, gifted my video games to the Library of Congress. I loaded them up in a bin, and I drove down to the Packard campus in Culpepper. I’d arranged this. They were interested in them, so all of my childhood Super Nintendo games and Nintendo cartridges are in a bin on a shelf. The cartridge of Earthbound at the Library of Congress has my saved game files in it, presuming that the battery hasn’t run out of power or something like that. That would be one.
Jer Thorp: David.
David Brunton: My introduction to the library came via a finding aid for the personal papers of John von Neumann which are held in the manuscript division here, and that are largely not available digitally. All of my favorite little tidbits come out of the manuscript collections of luminaries in the history of information science. Probably my current favorite is the papers of Lorenz are here, and which include Lorenz’ implementation of the Lorenz attractor.
David Brunton: Down in the preservation division, I spent an afternoon staring at it, running on a screen inside of an emulator, which I thought was a, I don’t know, it felt like a Back to the Future kind of thing. I wasn’t sure if I was looking at something from 30 years ago, or if I was looking at something 30 years in the future.
Jer Thorp: Jacqui.
Jacque Brellenthin: I have a very brand new favorite, actually, that I’m just starting to dig into. The library just released the Gershwin home videos. A lot of my study back in college was early American musical theater and early American music at the turn of the century. It’s been really fun for me to start watching these videos and pick out these faces that I would write a lot about, but actually just see them just interacting and in such a simple way, all these home videos.
Jacque Brellenthin: I’ve just watched a few, and I found a few metadata issues that I want to note. That’s going on too, but I really have enjoyed pulling them up and then following the trail where I’m going as I watch through these videos, not on work time, Ted.
Jer Thorp: A catalogers work is never done.
Ted Westervelt: It’s work-related. It’s fine. It’s fine. It’s other duties as assigned.
Jer Thorp: Ted.
Ted Westervelt: I think one of the things I work most is when you find stuff in the collection in sort of act of acquiring that’s about hobbies, especially what people become very passionate about. Just to underline, there was a print journal, I hope it’s digital, but, in any case, I remember seeing years ago. It was something like the society of brick collectors. These were people that collected bricks and they were fascinated by them.
Ted Westervelt: On the back page there’d be someone with a picture of this special brick, their favorite brick, and describe it. That is just so touching. It’s so wonderful that people care about these sort of things. Now, we’ve collected the back files of not bricks, but people, I think it’s the North America Jaguars Owners Society.
Ted Westervelt: We’ve got the past 50 years worth of it, and it’s wonderful to see people who are so passionate about their hobbies and interests that they’re willing to produce this, and then give it to us, in any case, so that we can preserve this wonderful snapshot on what just drives people in their personal lives [inaudible 00:50:53].
David Brunton: Apparently in this case what drives them is a Jaguar.
Ted Westervelt: Very good. Thank you.
Jer Thorp: A dad joke from David.
Trevor Owens: The animals or the cars?
Ted Westervelt: The cars. Sorry. No, not the animals, the cars.
Jer Thorp: That’s a really nice linkage, because I was a teenager in a time when the internet was very much a hobbyist thing. I ran a bulletin board that I would run in the evenings. We talked about mailing diskettes back and forth together. There’s something nice to wrap the idea of your gamer hobby becoming a piece of this digital archive.
Jer Thorp: I want to say thank you to all of you for this discussion. It was really great.
Various: Thank you.
Jer Thorp: All right.
Male: This is great.
Jacque Brellenthin: 245, Artist in the Archive. 264, Washington, DC, Library of Congress 2017. 300. One online resource (audio files). 336, spoken word SPWRDA content. 337, audio SRDA media. 337, again, computer CRDA media. 338, online resource, CRRDA carrier. 344, digital. 347, audio file RDA. 347, MP3.
Jer Thorp: Artist in the Archive is recorded by Jer Thorp and produced by Margaret Kelly. The music you heard was composed by [Roll 00:52:21] Music. You can find out more about some of the things we’ve talked about today by visiting our finding aid at artistinthearchive.tumblr.com. As always, we’d love to hear from you. If you have any questions or comments, you can leave us a note on the Tumblr, or you can email me directly: jer, J-E-R@ocr.nyc. I’ll see you next time.