Document Archiving Practices

Bogo Vatovec, Bovacon

Introduction

Discussing an archiving system is, well, somehow old-fashioned. In the times of Web 2.0, social networks, paperless documentation, and documentation on demand, archiving simply doesn’t sound really interesting. Unfortunately, interesting or not, it is becoming more important and more complex than ever.

So why is archiving more important and complex than ever? The first reason is simply that there has never been more data produced and stored than now – and the amount is increasing. The second reason is that data is being stored in more formats and more media than ever. The third reason is that for most new formats and media we don’t yet have the data about their reliability over time. And these reasons combine exponentially. Because there are so many formats and media and because we don’t know much about their long-term reliability, most critical data needs to be archived using several formats and media.

Before I go on, let me define what kind of archiving I will be writing about: archiving and storage of printed or electronic documents. In traditional archiving terminology, only storage of printed documents would be considered archiving; storage of electronic documents would be in a digital library or a digital archive. In this article, I’ll use archiving as a term to cover all possible usages and will differentiate between printed and electronic formats where needed.

My goal is not to cover all aspects of archiving and to go in depth in all areas. Instead, I aim to provide an overview and some practical experience from a government agency project we recently completed.

Starting to think about archiving procedures

The major challenge behind archiving is to know what, where, and how often to archive. Unfortunately, these questions are, in most organizations, not easy to answer.

A project to establish archiving procedures is similar to establishing knowledge management – they both require some tedious internal organizational and structuring work in data analysis. But the good news is that there is no need to have these tasks 100% completed. Just about any success rate is a success – it is better to archive 10% of all critical information than none at all.

One thing you will need to clarify soon is the difference between backing up and archiving. Here a basic definition to help you further analyze your situation:

  • Data is backed-up while still actively in use. Data in an electronic format may be backed-up daily on another disc or tape. Data on paper may be copied and stored on a shelf or scanned in and stored electronically. Backed-up data may be overwritten depending on the backup policies; for example, a weekly backup may overwrite the last one. When you are still working on documentation, the work-set should be regularly backed-up.
  • Data is archived when not actively in use anymore. The data set is cleaned up of temporary versions, and data that should be archived is stored in a safe place. Typically, this work is done at the end of the project or after a major release.

This definition may seem obvious and trivial, but I recommend defining the difference clearly at the beginning. Everybody will have his or her own opinion, and the different definitions will cause great misunderstandings later in the project. Still, the lines are often blurred. As with almost any project, the archiving project consists of the following phases:

  1. Define requirements for the archiving system/procedures.
  2. Design the system and procedures.
  3. Implement the system and procedures.
  4. Test the system and procedures.
  5. Roll-out.

A project like this one is almost bound to be done iteratively.

Define the requirements for the archiving system

The majority of time and effort will go into the requirements phase. Depending on the scope of the project and the size of the company, this phase can easily take several months. It is recommended to break this phase down into smaller chunks in order to be able to show measurable progress and practical results.

What should you know and have at the end of this phase?

  • Have a data classification system and cluster data accordingly.
  • Know the archiving requirements for each data cluster.
  • Know the capabilities or requirements for the supporting IT-infrastructure.

Before you start analyzing and clustering data, look for the following information in your organization:

  • Are there backup policies and procedures in place?
  • Are there already some archiving policies and procedures in place?
  • Are there some known legal or functional requirements for data archiving?
  • Does your company already have some sort of a data and documentation classification system?
  • What is the existing infrastructure that is used to back up and archive data now?

This information will help you in the next steps.

During the data clustering activity, consider data as discreet data objects (document, electronic file, piece of paper) with various properties. Some knowledge and experience with information architecture, Object-Oriented Modeling, meta-data, and XML will be helpful, but there is no need to be an XML expert.

Start looking at the data objects and ask following questions:

  • When you restore data from the archive, in which format should it be in? Electronic? Paper? Microfilm?
  • How often will the archive have to be accessed?
  • What is the consequence if the archived data is lost due to medium failures or external catastrophes?
  • What are typical formats for this data object? For example, a document may be printed but the original is in an electronic form.
  • Is there a legal requirement to keep the data object for a specific time and/or in a specific format (electronic, printed)?
  • Is there a need to archive electronic documents in a paper form or paper documents in an electronic form (scanned)?

Do not take all the answers for granted. For example, in our project a requirement was to keep some documents in printed form for 15 years. After looking closely, we realized that this was not a requirement but an old solution for a requirement that the information be kept for 15 years – the paper format was added at a time when no other solution was available. We were able to change the requirement and reduce the costs significantly.

Design the system and procedures

Based on the gathered requirements you need to make the following design decisions:

  • The format in which data objects will be stored. If they are to be stored in a printed form, in which format, which paper? If electronic, which electronic format do you choose? You should be especially careful with the electronic format. Many will not be available in 15 or 20 years. See:http://www.digitalpreservation.gov/formats/ for more information.
  • On which medium will electronic data be stored? Here you need to be even more careful with formats – hardware failures make data forever unusable and most CD-ROMs and DVDs don’t fulfill the long life requirements for archiving. Consider always creating more archive copies or renewing backup media every couple of years. I strongly advise you to contact an expert before making a final decision for critical data.
  • Define the risk-management policies and procedures. For mission critical data objects, you might have to look into more sophisticated procedures. For example, we had to make both electronic and paper copies of such documents and store each version at two different locations in fire-safe safes.
  • Define the archiving policies and procedures and differentiate them from the backup procedures. Clear policies and procedures will ensure data safety, but you also need to make sure they are followed later.
  • Define a visible data object classification system. It is important to know which data object should be handled in which way in the archiving process. Depending on the number of objects, you may need to consider tagging data objects for easier recognition.
  • Define storage for the archive media, and design search and indexing mechanism for the archives. These designs may include, depending on your data, archiving structure and risk requirements, a huge amount of work where an experienced archivist can be helpful.

Note that with the recent increase in the amount of data, storage media has gotten significantly cheaper. Very often, it is cheaper simply to archive more data than to have data cleaned-up and pre-selected for archiving by a human being. Full-text search engines can also often replace tedious indexing and archive structuring work. On the other hand, there is always some manual work included with archiving – at the very least you need to be able to find all the media your data has been archived on.

Where to look for more information

Although one of the essential procedures in every organization, there is not much recent literature about archiving. Here are some personal recommendations:

Managing Archives: Foundations, Principles and Practice
Caroline Williams
Chandos Publishing (Oxford) Limited, 2006
Xvii + 248pp., ISBN 1-84334-112-3

Web Archiving
Julien Masanès (Ed.)
2006, VII, 234 p., 28 illus., Hardcover
ISBN-10: 3-540-23338-5
ISBN-13: 978-3-540-23338-1

Blog: http://hurstassociates.blogspot.com/

The Library of Congress maintains a great website with very valuable information:
http://www.digitalpreservation.gov/

To learn more about various file formats for electronic archiving and their sustainability:
http://www.digitalpreservation.gov/formats/