ACA Preparation Part 3: Preservation (continued)

Hi all! Let’s finish up a few remaining points from our discussion of preservation in the last post…

Emergency Preparedness

For most of us, it is easier to identify and solve issues that are happening right now. Let’s look at an example:

Problem: “I am processing a collection, and I ran out of archival folders.”

Solution: “I need to order more ASAP.”

It’s a simple example, but you get the point. On the contrary, when it comes to solving issues that are merely “potential” or seemingly nonthreatening at the moment, it can be a bit harder to find the motivation to do something about it. Hence, this is the reason why many archives either lack or struggle to maintain a viable emergency preparedness and response plan.

Emergency preparedness for an archives is not frame of mind or an abstract concept – it is an action plan. It is a written document that identifies various types of potential emergencies and their sources (natural disasters, human/mechanical errors…), explains the procedures for dealing with emergencies, and explains each staff member’s role in protecting the archives and its resources in the event of an emergency. In addition, these plans usually contain lists of first responders, necessary supplies, and diagrams of the building and the physical location of all archival materials. Here are a few nuggets about emergency preparedness courtesy of Mary Ritzenthaler’s book Preserving Archives & Manuscripts:

  • Emergency preparedness plans should include…
    • List of potential disasters and their sources
    • Physical location of all records
    • Mapped routes of escape
    • Supply lists
    • Written, step-by-step procedures for handling an emergency
    • Clearly delineated roles among staff members
    • List of first responders
    • List of exits
    • Preventative measures to help lessen the effects of a disaster or even prevent its occurrence all together
  • All staff members should read and know the emergency preparedness plan.
  • Emergency preparedness plans should exists in various locations, and staff members should know all of these locations.
  • Know your vital records and note their locations in the plan.
  • HUMAN SAFETY COMES FIRST. I know, it seems obvious, but when you’re an archivist, it is easy to forget that!
  • Have conversations with your first responders (security guards, firefighters, local police, etc.) in order to forge positive relationships with them and to keep them informed about archives and the type of protection they need. Remember, first responders may not know much about archives, so it’s good to educate them.
  • Consider options for 24/7 alarm systems.
  • Push for regular facility checkups so potential building issues can be identified early on.

Remember, an emergency preparedness plan is not the same as a disaster recovery plan. The latter is a plan for rebuilding once a disaster has already occurred. The former is about being ready for a disaster that may come one day and making plans that will hopefully mitigate its effects.

Digital Preservation

Digital preservation means taking action to ensure that digital objects persist into the future. Identifying and implementing the proper steps for digital preservation is quite a challenge for archivists, largely because technology is constantly evolving. This blog post will not even come close to scratching the surface when it comes to discussing digital preservation, but hopefully it will provide a few things to consider.

Digital objects come in many forms. Here are a few of the most common…

  • Digital photos
  • Digital documents
  • Harvested web content
  • Digital manuscripts
  • Static and dynamic data sets
  • Digital art
  • Digital media publications
  • Social media content
  • Digital video and audio

Digital objects are unique because their structure is so complex. In order to render digital objects and interpret them, we have to understand them as physical objects, logical objects, and conceptual objects. The physical object is what we can hold on our hand. It contains the inscription of signs on a physical medium (ex. file written on a flash drive or DVD). The logical object refers to the bitstream that is recognized and processed by software (. The conceptual object is the object as it is understood by humans, usually in a GUI-type of environment. In order to experience a digital object, we must use a physical object to express a logical one. This involves both hardware and software.

Main methods for handling digital preservation:

Migration – changing a digital object currently depending on an obsolete environment into a new object suited to a new environment

  • Media migration – moving bits from one media to another (no change in bits)
  • File format migration – taking a set of bits and transforming them into a new representation

Emulation – creating a new process in hardware and software that mimics an obsolete process to render a digital object without modification

  • Original system emulation – creator of content provides the original environment (ex. an obsolete software) and the archivist recreates this environment using disk images
  • Emulation as a service – creating a more generalized emulated environment using vintage data (ex. FreeDos was created as an emulated environment for programs that used to use MS-DOS)

Metadata is the last thing I’m going to mention. Metadata takes on a whole new meaning and importance when we are dealing with digital items because digital material is extremely vulnerable to losing context. Metadata may be created by the record creator, the donor, the archivist, and sometimes the machine or software rendering the digital material. There are many metadata standards and schemas for preserving digital material. Some include Dublin Core, METS, MODS, and PREMIS.

Management/Education

As I close my discussion on preservation, I’d like to share some great materials that might be of use when studying for the ACA exam, or in the future, managing preservation in an archival program. A few of these I’ve mentioned already, but some I have not…

  • Preserving Archives and Manuscripts by Mary Ritzenthaler
  • Northeast Document Conservation Center leaflets
  • Photographs: Archival Care and Management by Mary Lynn Ritzenthaler and Diane L. Vogt-O’Connor
  • Glossary of Archival Records and Terminology by Richard Pearce-Moses
  • “Defining ‘Born Digital’” by Ricky Erway
  • OAIS Model
  • Becoming a Trusted Digital Repository by Steve Marks
  • Countless articles in the American Archivist, Archival Issues, Archivaria, Archival Science, and others that talk about preservation

 

Again, these are just a few of the many resources available regarding preservation!

 

 

 

 

 

 

 

 

 

My Walkman Obso “left me”

Let’s play a game. What do all of the items in these photos have in common?

Aside from them all being intense triggers of nostalgia, the items in these photos share a special characteristic. They are all … obsolete! In other words, they suffer from an incurable disease that frequently haunts pieces of hardware and software in the world of electronic and digital preservation. Obsolescence, or the condition of being obsolete, means that a resource is “no longer in use” or has “fallen into disuse.”[1] Car phones, cassette tapes, U-matic tapes, VCRs, portable CD players (aka Walkman), Apple Macintosh software, and Microsoft Word Perfect are just a few examples of the many obsolete pieces of hardware and software that are fading further into history with each passing day. Though obsolescence is inevitable in a technologically advancing world, it presents a challenge for archivist attempting to persist electronic and digital media into the future. This is because electronic objects are “performances.” They require a source which is the content (ex. a data file), a process that can read and interpret the content (ex. a piece of hardware), and a performance of the object that makes it understandable and accessible to humans (ex. a specific software).[2] So, we can save a source all we want, but if we do not have the hardware we need to process it and/or the software we need to perform it, the content is as good as lost.

Oh geez … I’m getting emotional. Am I the only one who gets emotional during discussions of obsolescence??

However, there is hope. Well, hope for saving the content anyway. In order to preserve information that requires obsolete hardware or software, archivists often engage in migration, emulation, or a combination of both. Migration is the most common preservation approach. It means copying or converting digital objects from one technology to another while preserving the object’s significant properties.[3] Emulation is a bit different because instead of manipulating the object, it manipulates the object’s environment. Emulation involves “preserving the bitstream of the object and creating an access version by using current technology to mimic some or all of the environment in which the original was rendered.”[4]

While there’s no doubt that saving the content of digital objects is crucial, it is interesting to think about what we lose when an object’s original form becomes obsolete. For instance, I distinctly remember sitting in the car on family road trips listening to Brandy’s Never Say Never CD on my mom’s Walkman. Thanks to the advent of digital audio files, I still enjoy listening to the music from the CD on my iPhone. However, I can’t experience it the way I used to on the Walkman. Is this meaningful? Well, perhaps not in this situation, but in others, it may be.[5]

Check out this video from the National Archives (skip to 3:23) to see a cool collection of obsolete hardware that NARA is collecting!

What are some of your thoughts on obsolescence and ways archivists can manage it?

[1] Dictionary.com, s.v. “Obsolete,” accessed March 27, 2016, http://www.dictionary.com/browse/obsolete?s=t.

[2] Seth Shaw, “Representations and Performances” (presentation to distance learning ARST 5300 Digital Preservation course, August 31, 2015).

[3] “Selecting the Right Preservation Strategy: Migration,” JISC, accessed March 27, 2016, http://www.paradigm.ac.uk/workbook/preservation-strategies/selecting-migration.html.

[4] “Selecting the Right Preservation Strategy: Emulation,” JISC, accessed March 27, 2016, http://www.paradigm.ac.uk/workbook/preservation-strategies/selecting-emulation.html.

[5] For a more in-depth discussion of this topic, see Margaret Hedstrom et al. “The Old Version Flickers More”: Digital Preservation from the User’s Perspective, The American Archivist 69, no.1 (2006): 159-187.

Archivists: solving crime bit by bit

What could an archivist and a private investigator possibly have in common? “Not much,” might have been the answer thirty years ago. Nowadays, that answer has changed dramatically. We especially see this change manifesting itself through the emergence of digital forensics technology in archival contexts. Broadly speaking, digital forensics “is a branch of forensic science encompassing the recovery and investigation of material found in digital devices.”[1] At the 2001 Digital Forensics Research Workshop (DFRWS), digital forensics was defined as “the use of scientifically derived and proven methods toward the preservation, collection, validation, identification, analysis, interpretation documentation, and presentation of digital evidence … for the purpose of facilitating or furthering the reconstruction of events found to be criminal” (qtd. in Gengenbach 2012).[2] Whew. That’s a mouthful, isn’t it? This is starting to sound like an episode of CSI or Law and Order! That’s because, well…it kind of is. Digital forensics has its roots in law enforcement, and according to Martin Gengenbach, “it is only in the recent past that practitioners and researchers in digital forensics and digital preservation have recognized the overlap in their respective fields.”[3] When it comes to digital preservation, we see concepts like evidence, proof, and trustworthiness—things we may normally associate with law enforcement and crimes—take on a new meaning. Obsolescence, bit rot, bit flips, unauthorized access, and questions of custodianship are just a few of the many threats to digital materials that archivists must work against. Hence, being able to prove the integrity of digital objects and the integrity of the archives managing the digital objects is crucial.

As digital material becomes more and more prevalent in archival acquisitions, the ability of archival institutions to preserve and manage digital materials in a trustworthy manner becomes more imperative. Thankfully, there are some tools to help us do this. BitCurator is one example. The BitCurator project began through a grant from the Andrew W. Mellon Foundation in October 2011. The principal developers of the project hail from the School of Library and Information Science at the University of North Carolina at Chapel Hill and from the Maryland Institute for Technology in the Humanities at the University of Maryland—College Park.[4] The purpose of the BitCurator project is to provide “a stack of free and open source digital forensics tools and associated software” to help libraries, archives, and museums (LAMS) incorporate digital forensics into the archival workflow.[5] BitCurator helps archivists transfer and extract digital materials to and from repositories with integrity and trustworthiness; it facilitates access by helping users to make sense of digital data within the appropriate context; it also helps identify and protect sensitive data that may exist in digital collections.[6]

If you ask me, all of this sounds like a big job for one tool, doesn’t it? Well, that’s why BitCurator isn’t just “one tool.” It’s actually a suite of tools working together to accomplish a variety of functions. This is what we call a modular tool, meaning that it is built from many different parts as opposed to being one, monolithic architecture. Let’s break this down a bit further by exploring a few of BitCurator’s functions and naming some of the tools it uses to complete these functions…

Function: Creating disk images

Example Tool(s): Guymager, dcfldd, cdrdao

Function: Forensic analysis and metadata generation

Example Tool(s): fiwalk, bulk_extractor, The Sleuth Kit

Function: Fixity checking and validation

Example Tool: GtkHash

Function: Facilitating Access

Example Tool: BitCurator Disk Image Access Tool

Function: Locating and removing duplicate files

Example Tool: FSlint

Function: Indexing

Example Tool: Recoll

These are just a few functions and tools, but if you are interested in viewing the entire list, visit the BitCurator wiki at this page: http://wiki.bitcurator.net/index.php?title=Software#Tools_in_the_BitCurator_Environment.

I do not have personal experience using BitCurator, but I’m curious to know if anyone else does. What was your original goal when you began using BitCurator? Were you able to accomplish this goal? What do you like/not like about BitCurator? In your opinion, how user friendly is it?

bits-bytes

[1] “Digital Forensics,” Wikipedia, last modified October 1, 2015, https://en.wikipedia.org/wiki/Digital_forensics.

[2] Martin J. Gengenbach, “The Way We Do It Here: Mapping Digital Forensics Workflows in Collecting Institutions,” (MLIS master’s paper, University of North Carolina at Chapel Hill, 2012), 6, http://digitalcurationexchange.org/system/files/gengenbach-forensic-workflows-2012.pdf.

[3] Ibid., 10

[4] Christopher A. Lee et al., “From Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting Institutions” (a product of the BitCurator Project, September 30, 2013), 1, http://www.bitcurator.net/docs/bitstreams-to-heritage.pdf.

[5] “BitCurator: About the Project,” UNC School of Library and Information Science, last accessed December 15, 2015, http://www.bitcurator.net/bitcurator/.

[6] Christopher A. Lee et al., “From Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting Institutions” (a product of the BitCurator Project, September 30, 2013), 3, http://www.bitcurator.net/docs/bitstreams-to-heritage.pdf.

A stint in Cleveland, a lifetime of memories

Thanks to my amazing internship supervisor at the Internet Archive, Jefferson Bailey, I had the opportunity to attend the Archive-It partner meeting at the 2015 Society of American Archivists annual meeting in Cleveland, Ohio. It was so much fun! I really enjoyed getting to know the Archive-It staff members face-to-face. During my time at the meeting, I made a presentation about my work with the Archive-It K-12 Web Archiving Program; I learned about the current status of the Internet Archive and Archive-It, and I also found out about some exciting web archiving efforts that various institutions are engaging in through partnerships with Archive-It.

Me presenting at the 2015 Archive-It Partner Meeting--part of the 2015 Society of American Archivists annual meeting
Me (JoyEllen) presenting at the 2015 Archive-It Partner Meeting–part of the 2015 Society of American Archivists annual meeting
Archive-It goodies!
Archive-It goodies!

Now I’ll share some of the most interesting info I learned from the meeting:

Some Archive-It stats:

  • Archive-It already has 64 new partners this year
  • Since 2006, Archive-It partners have created over 530 terabytes of data
  • Archive-It partners are responsible for preserving 12 billion URLs
  • Archive-It had seven K-12 partners this year

Some Archive-It partnerships I found particularly interesting:

  • The National Library of Medicine is currently collecting born-digital resources to document the 2014 Ebola outbreak
  • The Kansas state archives along with various universities in Kansas have partnered together to create the Kansas Archive-It Consortium (KAIC). Yes, it is pronounced like “cake.” The goal is to create a web archiving “documentation strategy,” if you will, involving multiple institutions and providing increased access to preserved web content.
  • The University of Scranton uses DuraCloud, an on-demand storage space powered by DuraSpace, to back up its Archive-It WARC files. Rather be safe than sorry!

Some updates on the Internet Archive:

  • About 490 billion URLs collected
  • Receives about 3 million visits per day
  • Currently collecting millions of hours of television (though not all of this is available)
  • Forward-focus goals:
    • Lessening focus on collecting and now focusing more on accessibility
    • Working on changing the Wayback machine to make searches easier (i.e. you won’t have to know the exact URL of a site to search its web archives)
    • Focus on improving book and text searches

Lastly, I’d like to make a huge-shout out to Archive-It for hosting a lovely dinner on Tuesday night! We enjoyed dinner at Hodge’s restaurant in downtown Cleveland. OMG. If you’re ever in Cleveland, you must go. It is yummy!
hodges logo

Hold up, wait a minute!

It recently came to my attention that I may have jumped the gun a little bit. In my eagerness to talk about web archiving, particularly, Archive-It’s web archiving initiative targeted for K-12 students, I failed to explain how this process actually works. After all, managing born-digital content is a relatively new venture in itself, and sometimes it can be hard to conceptualize. Hence, we’re going to take a closer look at how Archive-It renders its web archiving services. And yes, I’m sure a few inquiring minds out there want to know how in the world K-12 students are able to do this whole web archiving thing at such a young age. Don’t worry…we’ll talk about that too 🙂

What is a web archive?

Archive-It defines a web archive as a “collection of archived URLS grouped by theme, event, subject area, or web address.”[1] The goal of a web archive, however, isn’t just to make web content available. A web archive strives to recreate the same (or as close to the same) web experience a user would have gotten the day that site was archived. Basically, if you’re visiting an archived site, and it has the same background you remember, the right title, those fonts you used to love…yet when you scroll down you see a bunch of these everywhere

bad imagebad image 2

that’s not good.

What is Archive-It?

Archive-It is a web archiving service powered by the Internet Archive. It was established in 2006 with the goal of working alongside partner organizations to archive their specific web content. I know what you might be thinking…”Doesn’t the Internet Archive do all of this same web archiving already?” Well, yes and no. The Internet Archive does archive the web, just on a broad scale. It’s called, the “General Archive.” It’s automated, free, and it takes one snapshot of the web roughly every two months, capturing about 3 billion web pages per snapshot.[2] The downside is, there is no guarantee that any specific web content will be preserved on a regular basis. Individual organizations don’t have control over when these snapshots take place, and the General Archive collection is not easily searchable. Archive-It, on the other hand, is a user-friendly subscription service that allows organizations or individuals to maintain control over when and how their web content is preserved. The record creator becomes the archivist, selecting sites for preservation, capturing them, and creating metadata to facilitate the future use of these sites.

How does Archive-It actually archive it?

Archive-It is a web-based application, meaning it does not require software. It users open-source technology developed at the Internet Archive.[3] Here are the worker bees:

Heritix—The web crawler that crawls and captures web pages

Umbra—Assists the crawler

Wayback—The access tool that renders sites and lets us view them

NutchWAX—Facilitates full-text searching

SOLR—Facilitates metadata searches

As an Archive-It partner, what’s my job?

All Archive-It partners get a login account. Within this account, partners can manage how their collections are preserved and accessed. Once a partner is logged in, the home site will look something like this:

Archive-It home site pic

Within a login account, partners can…

  • Create a new collection
    • Choose a name for the collection
    • Choose the frequency of crawls
    • Write metadata
    • Select topics/categories
    • Add “seeds”—seeds are the starting point URLs for the crawler (the seeds determine which sites are “in scope” for your crawl)
    • Manage the extent of the crawl
  • View summaries and reports about archived collections and other data that has been captured
  • Run test crawls
  • Start crawls
  • Manage collections and seeds
    • Can change the previous settings chosen in “create a new collection”
  • Manage access settings and searches
    • Can make access public or private for entire account, certain collections, or even specific URLs, crawls, pages, or IP addresses
    • Can browse archived content 24 hours after capture is complete through the Wayback machine
    • Full-text search is available after 7 days

Where is all of this data stored?

I’m glad you asked. Archive-It uses multiple methods of storage. When you create a collection through Archive-It, two copies of the archived data are stored at the San Francisco Data Center. Collections are then transferred to the General Archive as a third copy. You also have the option of obtaining a copy of the archived data on a hard drive, and you have the ability to download files from the Internet Archive server, aka the “PetaBox” storage system (shown below).[4]

IA petabox

Archive-It is also working with other digital preservation initiatives including the Stanford University-based program Lots of Copies Keep Stuff Safe (LOCKSS) and DuraCloud, a service of DuraSpace.

How on earth can K-12 students do web archiving? Isn’t it too complicated for them?

You’d be surprised. While most of the students involved in the program are in middle and high school, students as young as fifth grade have been involved in the Archive-It K-12 web archiving program and have succeeded wonderfully. Of course, their participation requires guidance from a teacher who has attended the necessary Archive-It training sessions and who is committed to helping them through the process. The students’ main responsibilities in the web archiving program include

  • Deciding on themes or topics for collections
  • Evaluating and selecting websites they want to preserve
  • Working with their teacher to use the Archive-It web application and enter URLs for preservation
  • Writing descriptive metadata about their collections
  • Using the online access interface to review their crawls and see what worked and what didn’t
  • Completing a short survey[5]

It’s really amazing to see what these students can do https://archive-it.org/explore?fc=organizationType%3Ak12ProjectSchools

What if I want more information?

In fact, you should want more information, because I just scratched the surface here. I highly recommend that you sign up for the free informational webinar that Archive-It hosts via WebEx every few weeks. I took the webinar twice as a part of my internship, and it was wonderful both times. That’s where I got most of the information I just shared with you. The next webinar is scheduled for July 14, 2015 at 11:30 a.m. Pacific Daylight Time. If you’re interested, visit this link: https://archive-it.org/contact-us.

I hope this helps a bit!

[1] Lori Donovan and Scott Reed, “Archive-It Archiving and Preserving Web Content,” (webinar presentation, June 2, 2015).

[2] Jefferson Bailey, “Educational Partnership Training,” January 24, 2015.

[3] Donovan and Reed, “Archive-It Archiving and Preserving Web Content,” 2015.

[4] Donovan and Reed, “Archive-It Archiving and Preserving Web Content,” 2015; “Petabox,” Internet Archive, accessed June 18, 2015, https://archive.org/web/petabox.php.

[5] Archive-It, K-12 Web Archiving Program, accessed June 18, 2015, http://aitlearnmore.archive.org/files/2014/07/k12_webarchiving_overview.pdf