You are on page 1of 9

Library Hi Tech

Archiving in the networked world: authenticity and integrity


Michael Seadle

Article information:

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

To cite this document:


Michael Seadle, (2012),"Archiving in the networked world: authenticity and integrity", Library Hi Tech, Vol.
30 Iss 3 pp. 545 - 552
Permanent link to this document:
http://dx.doi.org/10.1108/07378831211266654
Downloaded on: 03 November 2014, At: 11:28 (PT)
References: this document contains references to 5 other documents.
To copy this document: permissions@emeraldinsight.com
The fulltext of this document has been downloaded 610 times since 2012*

Users who downloaded this article also downloaded:


Michael Seadle, (2012),"Archiving in the digital world: the scholarly literature", Library Hi Tech, Vol. 30 Iss 2
pp. 367-375

Access to this document was granted through an Emerald subscription provided by 434496 []

For Authors
If you would like to write for this, or any other Emerald publication, then please use our Emerald for
Authors service information about how to choose which publication to write for and submission guidelines
are available for all. Please visit www.emeraldinsight.com/authors for more information.

About Emerald www.emeraldinsight.com


Emerald is a global publisher linking research and practice to the benefit of society. The company
manages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as well as
providing an extensive range of online products and additional customer resources and services.
Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committee
on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive
preservation.
*Related content and download information correct at time of download.

The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0737-8831.htm

REGULAR PAPER

Archiving in the networked


world: authenticity and integrity
Michael Seadle

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

Berlin School of Library and Information Science, Humboldt-Universitat zu


Berlin, Berlin, Germany

Archiving in the
networked world

545
Received June 2012
Revised June 2012
Accepted June 2012

Abstract
Purpose This article aims to discuss how concepts from the analog world apply to a purely digital
environment, and look in particular at how authenticity needs to be viewed in the digital world in order
to make some form of validation possible.
Design/methodology/approach The article describes authenticity and integrity in the analog
world and looks at how to measure it in a digital environment.
Findings Authenticity in the digital world generally means, in a purely technical sense, that a
documents integrity has been checked using mathematical algorithms against other copies on
independently managed servers, and that provenance records show that the document has a clearly
established succession from a clearly defined original. Readers should recognize that this is different
than how one defines authenticity and integrity in the analog world.
Originality/value Most of the key issues surrounding digital authenticity have not yet been tested,
but they will be when the economic value of an authentic digital work reaches the courts.
Keywords Archiving, Digital preservation, Digital documents, Electronic publishing,
Information technology, Preservation, Computer networks, Digital libraries
Paper type Research paper

Introduction
Authenticity and integrity are concepts that lie at the heart of long-term archiving,
whether digital or analog. Essentially every long-term archiving system claims to pay
attention to guaranteeing the authenticity of content. Portico, for example, lists
authenticity as one of the key goals of digital preservation on its web site:
Authenticity the provenance of the content must be proven and the content an authentic
replica of the original (Portico, 2012).

While the goal of maintaining authenticity is clear, the steps for achieving the goal are
not, despite long years of discussion. This article does not pretend to a systematic
analysis of the literature in our field about authenticity, but begins with highlights
from a few key figures to show the progression of thought on the topic.
A number of authors refer to Clifford Lynchs (2000) report about authenticity as the
defining work. In fact he is careful to point out the problems with our attempts at a
precise definition and why they tend to fail:
This distrust of the immaterial world of digital information has forced us to closely and
rigorously examine definitions of authenticity and integrity definitions that we have
historically been rather glib about using the requirements for verifiable proofs as a
benchmark. As this paper will demonstrate, authenticity and integrity, when held to this

Library Hi Tech
Vol. 30 No. 3, 2012
pp. 545-552
q Emerald Group Publishing Limited
0737-8831
DOI 10.1108/07378831211266654

LHT
30,3

546

standard, are elusive properties. It is much easier to devise abstract definitions than testable
ones. When we try to define integrity and authenticity with precision and rigor, the
definitions recurse into a wilderness of mirrors, of questions about trust and identity in the
networked information world (Lynch, 2000).

Seamus Ross (2002) recommends that the community needs ... to investigate the role
trust plays in authenticity and integrity of digital objects. He goes on to raise serious
questions about the role of trust in establishing authenticity:

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

In many instances users and preservers establish authenticity on the grounds of trust in the
organization involved or technology used in the preservation of the digital object. The current
understanding of the major factors that drive trust decisions in the digital world, as well as
the risks involved with having and implementing this sort of trust is limited (Ross, 2002).

Instead of relying on trust, Regan Moore (2007) and McKenzie Smith link claims of
both integrity and authenticity to a verification process:
A trustworthy preservation environment is one in which all assertions on integrity and
authenticity have been verified within a specified time period. The verification process must
be repeated periodically. Preservation environments are live systems that require constant
appraisal and validation.

This is also the position that David Rosenthal (2011) takes:


Note that because each LOCKSS box collects content independently, and then audits the
content against the other LOCKSS boxes with the same content, we have the kind of
automated authenticity check discussed earlier by Howard Besser and Seamus Ross, albeit
only for static Web content (Rosenthal, 2011).

This article will discuss how concepts from the analog world apply to a purely digital
environment, and looks in particular at how authenticity needs to be viewed in the
digital world in order to make some form of validation possible without either resorting
to trust or to Cliff Lynchs wilderness of mirrors.
The meaning of authenticity
Authenticity in the physical world implies genuineness, and with physical objects a
sense that it is the actual original, but the concepts of authentic, genuine, and original
grow less clear the more closely an object is examined. Is, for example, a contemporary
printed copy of Charles Dickens novel Oliver Twist authentic? Likely it contains
many of the original words, but some words and phrases from the initial publication
were corrected in later editions. The format and type-fonts would be different than the
original, which appeared in serial form in periodicals. Notes may have been added.
In one sense, the true authentic version of the novel might be the one that Dickens
wrote by hand. In another sense, the genuine original might be the one first made
available to the public. In a third sense, any subsequent edition is arguably authentic if
it faithfully reproduces the authors intent in so far as that is known. Even in the
physical world the concept of authenticity rapidly becomes open-ended once it is
divorced from a specific object: this work in this version at this time.
Mutability and provenance
Authenticity in the analog world relies, in part, on the relative immutability of physical
materials such as print on paper or paint on canvas. This immutability is only partially

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

reliable. The block of a book could be taken apart and individual pages replaced with
altered texts printed on old paper using old ink and a hand-operated press. This does
not happen in part because many copies of most important works exist, and because no
obvious market for faked or altered versions exists. The situation for paintings is
different because of the high market value of works by certain artists.
Authenticity in the physical world generally relies on the chain of provenance, for
example on evidence that a particular printed book passed from its creator (in this case
the publisher, not the author) to the vendor (a reliable bookstore or, in the case of
libraries, a company like Harrassowitz or Yankee) to the library itself. For most
contemporary library books this chain can usually be documented with little trouble.
Older and more unique materials in special collections departments may have less
clear provenance. The source of a valuable book picked up in a reputable used
bookstore may be traceable. The provenance of a book bought at an auction or from an
individual collector may be harder to establish. In fact, however, few librarians worry
about the authenticity of even their most valuable works. A strong assumption exists
that a printed book is what it claims to be and thus far there is little evidence to suggest
that this assumption is wrong.
The situation with medieval or ancient manuscripts is far more complex, because
provenance is much harder to establish. For many manuscripts no original exists and
paleographers must often struggle to reconstruct a virtual original by comparing
manuscripts. Copying by hand was prone to error and subject to willful changes on the
part of the person doing the copying.
Authenticity matters whenever there is a market or other (for example, intellectual)
value in having a clearly defined original. The situation for paintings shows that
clearly, because most paintings exist in only one physical original, which may have
such high market value that the effort to create imitations becomes worthwhile. The
market for undiscovered paintings by famous artists has a long history, and
temptation among collectors often overcomes their desire for reliable proofs of the
provenance even proofs can, of course, also be faked. Nonetheless, an original
painting with a well-established provenance may not be authentic, in the sense that
time has aged its colors and well-meaning restorers may have introduced changes not
present in the original. In this case the physical object may be authentic, but perhaps
not the image as the painter saw or intended it.
Style has proven to be a particularly unreliable basis for authenticity judgments,
because style can be imitated. The general rule-of-thumb is that fakes become more
obvious over time because of subtle anachronisms that are culturally invisible at the
time of creation, but grow more and more obvious over time. Provenance has been
more reliable, but the actual chain of provenance for older works can be lost for
legitimate reasons: wars, thefts, fires, and the like.
Authenticity problems with books tend to involve plagiarism or mass copying. An
example comes from the market in China for illegal copies of English-language
textbooks from US publishers. Sometimes the content is genuinely what the US edition
has, and the main problem is that the rights owner does not get paid. More often the
copies contain inaccuracies or are missing features. Sometimes the cover claims to be a
current edition, but the contents come from earlier versions. In these cases the
librarians usual authenticity tests (title page, for example) fail. This problem is neither
new nor limited to Asia. In the nineteenth century US publishers notoriously copied

Archiving in the
networked world

547

LHT
30,3

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

548

British (and other) best-selling works, often with inaccuracies that embarrassed the
original authors. Some of these works are in libraries today, though generally labeled
as fakes.
Digital authenticity
In the digital world, there are no originals, only copies, and the mutability of digital
objects makes authenticity especially challenging. The web pages that readers see on
their screens are not precisely what sits on the server. They are rather copies of code
sent in packets via the internet and rendered according to sets of rules by the browsers
on the client computers. Different browsers may render web pages differently and
different computer and screen types will almost certainly render colors differently than
the original, unless someone has taken the trouble to calibrate the colors. For
text-based works these minor variations probably matter little, but for full-color art the
issue is significant. Some variations in paper-based works may exist too, but generally
fewer.
This type of mutability is not even a matter of deliberate changes to content, but it
goes to the heart of the question: what is authentic? If an authentic original looks
different to different users because of their browsers and screens, what remains that is
measurable? The answer, from a computing perspective, is the code.
The idea of authenticity in the digital world is more closely related to integrity than
in the analog world. A digital work that exists measurably unchanged in multiple
independent copies possesses a form of integrity that can support its claim to
authenticity. A digital object whose integrity is lost through changes could
theoretically still be authentic in some meanings of the word in an analog
environment, but its authenticity becomes harder to prove.
Provenance is also a measure of authenticity in the digital world, but one that needs
to be judged carefully on the basis of controlled conditions. A digital object on a secure
server with a proven record of resistance to viruses and external attacks can make a
reasonable claim to authenticity when it comes (with appropriate measures for
integrity checking) from another secure server. A digital object sitting on a server that
had previously been hacked or had had viruses could transfer an authentic copy but
it could also transfer an infected or altered one. A problematic provenance raises
questions, as it would in the analog world, but an integrity test could allay doubts, as
long as a secure and genuine comparison copy could be found.
Authenticity in the digital world can mean an exact copy of a text or image. It could
equally well mean that text and image in the original context. An example of this
problem comes from a copyright case in which the defendant used an in-line-link to
display an authentic version of a Dilbert cartoon from the publishers web site, but
deliberately provided a different context that he felt was more appropriate to the work.
Was this the equivalent of hanging an authentic painting in a different museum, or
more like altering a portion of a published work? Both answers could be reasonable in
the digital world, but the original publisher argued that the altered context created an
illegal and thus inauthentic copy (www.cs.rice.edu/, dwallach/dilbert/).
While no clear and established measures of authenticity exist for digital objects, a
reasonable argument can be made that digital objects have a claim to authenticity
when their integrity can be measured and can be shown to be the same as other digital
content on a secure server, for example that of the original publisher. The same need

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

for integrity should apply to digital proofs of provenance. It also seems reasonable to
claim that digital content cannot be considered authentic if its integrity is questionable.
Integrity is, however, also no simple concept. For this reason it is worth looking at
how integrity is judged, first in the analog and then in the digital world.

Archiving in the
networked world

Integrity in the analog world


The integrity of a published book is easy to take for granted. Librarians tend to assume
that all of the books in their collections have uncompromised integrity unless they
discover that pages are missing or have been damaged beyond readability. Even then
the integrity loss is generally considered to be limited to the specifically damaged area.
Agreements with other libraries allow them to get copies of the missing or damaged
pages and libraries generally have staff who know how to insert (tip in) the new
pages so that the work is whole again.
In practice, these integrity checks occur only during use: a reader or a librarian
wants to read a work and finds pages missing. No systematic integrity checks take
place for works that sit unread in the stacks. If the threat to integrity were to come only
from external users, and not library staff or from environmental factors like insects or
mold, this lack of systematic checking might raise no problems. In fact staff do
sometimes steal valuable pictures or articles from works in the library, and insects and
mold are major problems in moist, warm climates, especially in libraries with limited
climate control. Problems with insects and mold can also occur in well-managed
Northern libraries, though they are more rare.
Larger but more infrequent disasters such as fire, water, and building collapse
damage the integrity of physical works. Fire is rarely as much of a problem as water,
since books do not burn readily, but their paper does quickly absorb water. Modern
quick freezing methods can recover damaged works if action is taken fast enough.
Some visible damage is still likely, but not generally so much that the integrity of an
analog work is considered to be compromised. A lost or missing volume is an integrity
problem for a series and for a collection. When a work is misshelved, it can be missing
for a long time and may be hard to find. In a large collection, misplaced books become a
significant problem. This is especially true in open stacks libraries where readers
sometimes deliberately put volumes in the wrong place in order to hide them for their
own use.
While these integrity problems are serious, other libraries generally duplicate the
holdings. Truly unique works (medieval manuscripts, for example) tend to be held in
special locations under relatively tight security. They are still vulnerable to major
disasters, such as fire, flood, or building collapse (such as occurred in Cologne some
years ago). More insidious is the damage to works where librarians assume plenty of
other copies exist. Early in JSTORs history it discovered that some articles had been
stolen from so many libraries that it became difficult to find an undamaged copy.
Incautious assumptions about the long-term integrity of print-on-paper works may in
fact endanger them. Paper may sit undamaged on a shelf for hundreds of years, but
only if the right environmental and usage conditions are in place. Banning readers
from ever touching a book may be the most important way to protect the work, since
users do most of the intentional and unintentional damage, but that fits poorly with he
library mission of making information available.

549

LHT
30,3

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

550

Digital integrity
Integrity in the digital world receives considerable attention, since digital objects are in
their nature highly mutable and can change in unexpected ways due to conditions in
the storage medium that alter the bitstream. A bitstream is what defines a digital
object. It is the sequence of binary 0s and 1s that computers interpret to create images
of letter, numbers, and other formatting characteristics on a computer screen. A single
change in a bitstream may be as harmless as a printing error in a book. If an ASCII
hexadecimal 3A becomes hexadecimal 3B, the visual rendering only changes from :
to ;. Such changes may not be always be harmless or meaningless, though. Bit rot
(bits changing due to environmental or media changes) can damage parts of a
bitstream in ways that make it impossible for normal programs to render the file
correctly on the computer screen. Such changes may appear to destroy a bitstream
completely (though modern tools may still be able to recover damaged files) or they
may just make it not work in a particular program, or they may only inflict minor,
hardly visible, damage.
When a digital object loses its integrity, a sequence of problems arises. The simplest
is that the object has undergone some change that makes it different than before. This
may be as innocent as a marginal comment in a PDF document, the equivalent of a
reader writing a pencil note in a book and theoretically equally reversible. The change
could also represent some deliberate tampering to censor or change meanings. In some
contemporary societies this could be a real problem. There are also hackers who would
willingly alter digital versions of documents to disprove the facts of history: those in
Iran who claim the holocaust never happened, for example. Digital objects with these
integrity problems remain readable.
Integrity loss may also make a digital object unreadable. Many librarians view this
as the most serious problem, though deliberate changes in meaning if undetected
are arguably more serious. Readability is not a simple binary state in which a file is or
is not readable, but a broad range of conditions and problems that are often solvable
with enough time and resource.
The simple solutions for integrity loss of the latter sort fall into two kinds. The first
is to look for a medium that promises stability over long periods such as decades, but
there is good reason to doubt that any physical medium is a good long-term carrier for
digital information and the concern about having reading devices for contemporary
storage devices in 100 years is real, though of course the devices could be reengineered
at need (cost permitting). A second kind of simple solution is to believe the assurance of
commercial firms that say their professional storage systems address all the key
integrity problems. The commercial firms do not necessarily lie when they give such
assurances, and commercial operations have considerable experience with digital
archiving. Nonetheless, the time perspective of commercial firms tends to be shorter
than that of librarians five or ten years rather than 100 and their tolerance for error
greater. As long as the integrity problems remain within an acceptable margin, it is
generally not worthwhile for businesses to push for greater assurance.
The fact is that data left on any contemporary storage medium can rot, that is, can
lose integrity. Systematic and frequent checking against multiple copies can detect and
avoid loss by replacing damaged files, but this cannot effectively be done for offline
storage and it cannot be done without appropriate algorithms. Since proprietary

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

archiving systems tend against full disclosure of their back-end storage mechanisms,
librarians who choose them have no choice except to trust to their assurances.
Testing standards, or at least insisting on detailed technical answers to questions
about integrity checking, could address this problem. It would also help to force
commercial firms to prove to what degree they can really guarantee integrity. At the
same time everyone should realize that no system can provide 100 percent assurance,
only a probability that content will be unchanged. Analog media, such as paper, cannot
provide 100 percent integrity guarantees over time: every object is liable to threats.
Digital measures of integrity
Digital integrity needs an appropriate measure. Generally systems use checksums or
hashes to give a reasonable approximation of whether two files are identical.
Checksums, as the name suggests, add up the number of bytes or bits in a file or part of
a file. The checksum from a file ought to be identical with its copy. Any change
indicates an integrity loss. The internet uses checksums to detect packet damage. Not
all checksum algorithms will necessarily detect a simple situation where two bits have
flipped, but most bit-rot problems and almost any deliberate alteration of the digital
object tend to create changes on a larger scale, making checksums an effective means
of integrity assurance.
The major problem with using checksum and hash algorithms as integrity
measures is that they are either / or character: a file either has integrity according to the
measure, or it does not. This means that integrity in the digital world has none of the
gray area uncertainties and flexibilities that are a common feature in judging analog
integrity, which offers advantages (clarity) but also problems (rejecting files with
minor problems).
Waiting to see whether a standard program can open a digital file is another means
of integrity checking at a crude level. Today virtually any word processing program
can open a pure ASCII file and would detect changes only if the ASCII encoding itself
changed into non-ASCII characters. In other words, deliberate additions or deletions
with correct encoding would be OK, which they would not be with a checksum. Minor
damage may also go undetected if a human did not scan the whole file on the screen.
One of the considerations for new measures of integrity would be to separate the
equivalent of a penciled note in a margin from more substantial changes. This is
conceptually difficult, since it involves an element of human judgment about the
amount of the change. Potentially, a system could isolate the changed area and
compare other parts of a file to check integrity. Also potentially, an intelligent system
could attempt to judge the significance of a change to give a qualified measure of
integrity.
Conclusion
Authenticity in the digital world generally means, in a purely technical sense, that a
documents integrity has been checked using mathematical algorithms against other
copies on independently managed servers, and that provenance records show that the
document has a clearly established succession from a clearly defined original. Readers
should recognize that this is different than how one defines authenticity and integrity
in the analog world. Digital integrity is not therefore better or worse, but it is different
and the differences need to be understood.

Archiving in the
networked world

551

LHT
30,3

Downloaded by Universiti Teknologi MARA At 11:28 03 November 2014 (PT)

552

Librarians, archivists, museum staff and others who deal regularly with
authenticity issues in the analog world may well want digital definitions of
authenticity and integrity more closely to approximate analog practices. This is not
realistic. The circumstances in the analog and digital worlds are simply different and
new authenticity standards and practices need to be established for digital content.
This does not mean that the digital definition of authenticity needs to remain
inflexibly binary: yes, authentic, no, not authentic. Shades of meaning are possible
where certain kinds of integrity flaws or possibly provenance losses occur, but these
conditions need a measurable (calculable) digital definition, where, for example, one
could explain time-gaps or compare copies by stripping back the layers of notes and
comments. Whether this is possible depends on the software and the nature of the
change.
It may also be that in the digital world shades of authenticity and integrity do not
matter to the extent that they do in the analog world, because a genuine document with
the digital equivalent of a pencil note simply becomes a new document that is
authentically itself, and not inauthentically something older. Most of the key issues
surrounding digital authenticity have not yet been tested, but they will be when the
economic value of an authentic digital work reaches the courts.
References
Lynch, C. (2000), Authenticity and integrity in the digital environment: an exploratory analysis
of the central role of trust, Council on Library and Information Resources, Washington,
DC, available at: www.clir.org/pubs/reports/pub92/lynch.html (accessed May 28, 2012).
Moore, R. and Smith, M. (2007), Automated validation of trusted digital repository assessment
criteria, Journal of Digital Information, Vol. 8 No. 2, available at: http://dspace.mit.edu/
bitstream/handle/1721.1/39091/Moore-Smith.htm?sequence1
Portico (2012), Preservation approach: digital preservation defined, available at: www.portico.
org/digital-preservation/services/preservation-approach
Rosenthal, D.S.H. (2011), How few copies?, DSHRs Blog, available at: http://blog.dshr.org/
2011/03/how-few-copies.html (accessed May 28, 2012).
Ross, S. (2002), Position paper on integrity and authenticity of digital cultural heritage objects,
DigitCULT: Integrity and Authenticity of Digital Cultural Heritage Objects, Vol. 1, available
at: www.digicult.info/downloads/thematic_issue_1_final.pdf
Corresponding author
Michael Seadle can be contacted at: seadle@ibi.hu-berlin.de

To purchase reprints of this article please e-mail: reprints@emeraldinsight.com


Or visit our web site for further details: www.emeraldinsight.com/reprints

You might also like