You are on page 1of 9

Data deduplication is a technique for reducing the amount of storage space an organization needs to save its data.

In most
organizations, the storage systems contain duplicate copies of many pieces of data. For example, the same file may be
saved in several different places by different users, or two or more files that aren't identical may still include much of the
same data. Deduplication eliminates these extra copies by saving just one copy of the data and replacing the other copies
with pointers that lead back to the original copy. Companies frequently use deduplication in backup and disaster recovery
applications, but it can be used to free up space in primary storage as well.
In its simplest form, deduplication takes place on the file level; that is, it eliminates duplicate copies of the same file. This kind
of deduplication is sometimes called file-level deduplication or single instance storage (SIS). Deduplication can also take
place on the block level, eliminating duplicated blocks of data that occur in non-identical files. Block-level deduplication frees
up more space than SIS, and a particular type known as variable block or variable length deduplication has become very
popular. Often the phrase "data deduplication" is used as a synonym for block-level or variable length deduplication.

What You Need to Know About Cloud Backup: Your Guide to Cost, Security, and Flexibility
Download Now

Deduplication Benefits
The primary benefit of data deduplication is that it reduces the amount of disk or tape that organizations need to buy, which in
turn reduces costs. NetApp reports that in some cases, deduplication can reduce storage requirements up to 95 percent, but
the type of data you're trying to deduplicate and the amount of file sharing your organization does will influence your own
deduplication ratio. While deduplication can be applied to data stored on tape, the relatively high costs of disk storage make
deduplication a very popular option for disk-based systems. Eliminating extra copies of data saves money not only on direct
disk hardware costs, but also on related costs, like electricity, cooling, maintenance, floor space, etc.
Deduplication can also reduce the amount of network bandwidth required for backup processes, and in some cases, it can
speed up the backup and recovery process.

Deduplication vs. Compression


Deduplication is sometimes confused with compression, another technique for reducing storage requirements. While
deduplication eliminates redundant data, compression uses algorithms to save data more concisely. Some compression is
lossless, meaning that no data is lost in the process, but "lossy" compression, which is frequently used with audio and video
files, actually deletes some of the less-important data included in a file in order to save space. By contrast, deduplication only
eliminates extra copies of data; none of the original data is lost. Also, compression doesn't get rid of duplicated data -- the
storage system could still contain multiple copies of compressed files.
Deduplication often has a larger impact on backup file size than compression. In a typical enterprise backup situation,
compression may reduce backup size by a ratio of 2:1 or 3:1, while deduplication can reduce backup size by up to 25:1,
depending on how much duplicate data is in the systems. Often enterprises utilize deduplication and compression together in
order to maximize their savings.

Dedupe Implementation
The process for implementing data deduplication technology varies widely depending on the type of product and the vendor.
For example, if deduplication technology is included in a backup appliance or storage solution, the implementation process
will be much different than for standalone deduplication software.

In general, deduplication technology can be deployed in one of two basic ways: at the source or at the target. In source
deduplication, data copies are eliminated in primary storage before the data is sent to the backup system. The advantage of
source deduplication is that is reduces the bandwidth requirements and time necessary for backing up data. On the
downside, source deduplication consumes more processor resources, and it can be difficult to integrate with existing systems
and applications.
By contrast, target deduplication takes place within the backup system and is often much easier to deploy. Target
deduplication comes in two types: in-line or post-process. In-line deduplication takes place before the backup copy is written
to disk or tape. The benefit of in-line deduplication is that it requires less storage space than post-process deduplication, but
it can slow down the backup process. Post-process deduplication takes place after the backup has been written, so it
requires that organizations have a great deal of storage space available for the original backup. However, post-process
deduplication is usually faster than in-line deduplication.

Deduplication Vendors
Many enterprise storage and backup solution vendors include data deduplication technology in their products, and some
companies also offer standalone deduplication software. Industry leaders include the following:
Barracuda Networks

CA Technologies
CommVault
Data Ladder
Dell
EMC
Exagrid
FalconStor
HP
IBM
NetApp
Quantum
Sepaton
Open source data deduplication projects include the following:
BlackHole

lessfs
Opendedup

Deduplication Technology
Data deduplication is a highly proprietary technology. Deduplication methods vary widely from vendor to vendor, and many of
those methods are patented. For example, Microsoft has a patent on single instance storage. In addition, Quantum owns
a patent on variable length deduplication. Many other vendors also own patents related to deduplication technology.
Next Steps
http://www.enterprisestorageforum.com/backup-recovery/8-deduplication-products-you-must-check-out.html
http://www.enterprisestorageforum.com/backup-recovery/more-deduplication-tools-to-check-out.html
http://www.infostor.com/storage-management/data-de-duplication/5-dedupe-products-worth-a-second-look.html
http://www.eweek.com/c/a/Data-Storage/How-to-Implement-a-Successful-Data-Deduplication-Strategy/

http://www.enterprisestorageforum.com/backup-recovery/taneja-group-research-deduplication-and-the-innovation-race-it-isntover.html

Data deduplication (often called "intelligent compression" or "single-instance


storage") is a method of reducing storage needs by eliminating redundant data.
Only one unique instance of the data is actually retained on storage media, such
as disk or tape. Redundant data is replaced with a pointer to the unique data copy.
For example, a typical email system might contain 100 instances of the same one
megabyte (MB) file attachment. If the email platform is backed up or archived, all
100 instances are saved, requiring 100 MB storage space. With data
deduplication, only one instance of the attachment is actually stored; each
subsequent instance is just referenced back to the one saved copy. In this
example, a 100 MB storage demand could be reduced to only one MB.
Data deduplication offers other benefits. Lower storage space requirements will
save money on disk expenditures. The more efficient use of disk space also allows
for longer disk retention periods, which provides better recovery time objectives
(RTO) for a longer time and reduces the need for tape backups. Data
deduplication also reduces the data that must be sent across a WAN for remote
backups, replication, and disaster recovery.
Data deduplication can generally operate at the file or block level. File
deduplication eliminates duplicate files (as in the example above), but this is not a
very efficient means of deduplication. Block deduplication looks within a file and
saves unique iterations of each block. Each chunk of data is processed using a
hash algorithm such as MD5 or SHA-1. This process generates a unique number
for each piece which is then stored in an index. If a file is updated, only the
changed data is saved. That is, if only a few bytes of a document or presentation
are changed, only the changed blocks are saved; the changes don't constitute an
entirely new file. This behavior makes block deduplication far more efficient.
However, block deduplication takes more processing power and uses a much
larger index to track the individual pieces.
Hash collisions are a potential problem with deduplication. When a piece of data
receives a hash number, that number is then compared with the index of other
existing hash numbers. If that hash number is already in the index, the piece of
data is considered a duplicate and does not need to be stored again. Otherwise
the new hash number is added to the index and the new data is stored. In rare
cases, the hash algorithm may produce the same hash number for two different
chunks of data. When a hash collision occurs, the system won't store the new data
because it sees that its hash number already exists in the index.. This is called a
false positive, and can result in data loss. Some vendors combine hash algorithms
to reduce the possibility of a hash collision. Some vendors are also examining
metadata to identify data and prevent collisions.

In actual practice, data deduplication is often used in conjunction with other forms
of data reduction such as conventional compression and delta differencing. Taken
together, these three techniques can be very effective at optimizing the use of
storage space.

Data deduplication
From Wikipedia, the free encyclopedia

[hide]This article has multiple issues. Please help improve it or discuss these issues on the talk
page.
This article possibly contains original research. (February 2011)
This article includes a list of references, but its sources remain unclear because it has
insufficient inline citations.(September 2009)
In computing, data deduplication is a specialized data compression technique
for eliminating duplicate copies of repeating data. Related and somewhat
synonymous terms are intelligent (data) compression and single-instance
(data) storage. This technique is used to improve storage utilization and can
also be applied to network data transfers to reduce the number of bytes that must
be sent. In the deduplication process, unique chunks of data, or byte patterns, are
identified and stored during a process of analysis. As the analysis continues, other
chunks are compared to the stored copy and whenever a match occurs, the
redundant chunk is replaced with a small reference that points to the stored

chunk. Given that the same byte pattern may occur dozens, hundreds, or even
thousands of times (the match frequency is dependent on the chunk size), the
amount of data that must be stored or transferred can be greatly reduced.[1]
This type of deduplication is different from that performed by standard filecompression tools, such as LZ77 and LZ78. Whereas these tools identify short
repeated substrings inside individual files, the intent of storage-based data
deduplication is to inspect large volumes of data and identify large sections such
as entire files or large sections of files that are identical, in order to store only
one copy of it. This copy may be additionally compressed by single-file
compression techniques. For example a typical email system might contain 100
instances of the same 1 MB (megabyte) file attachment. Each time
the email platform is backed up, all 100 instances of the attachment are saved,
requiring 100 MB storage space. With data deduplication, only one instance of the
attachment is actually stored; the subsequent instances are referenced back to
the saved copy for deduplication ratio of roughly 100 to 1.

Contents
[hide]

1 Benefits
2 Deduplication overview
2.1 Post-process deduplication
2.2 In-line deduplication
2.3 Source versus target deduplication
2.4 Deduplication methods
3 Drawbacks and concerns
4 Major players and technologies
5 See also
6 References
7 External links

Benefits[edit]
Storage-based data deduplication reduces the amount of storage needed for a
given set of files. It is most effective in applications where many copies of very
similar or even identical data are stored on a single diska surprisingly common
scenario. In the case of data backups, which routinely are performed to protect
against data loss, most data in a given backup remain unchanged from the
previous backup. Common backup systems try to exploit this by omitting (or hard
linking) files that haven't changed or storing differences between files. Neither
approach captures all redundancies, however. Hard-linking does not help with
large files that have only changed in small ways, such as an email database;
differences only find redundancies in adjacent versions of a single file (consider a
section that was deleted and later added in again, or a logo image included in
many documents).
Network data deduplication is used to reduce the number of bytes that must be

transferred between endpoints, which can reduce the amount of bandwidth


required. See WAN optimization for more information.
Virtual servers benefit from deduplication because it allows nominally separate
system files for each virtual server to be coalesced into a single storage space. At
the same time, if a given server customizes a file, deduplication will not change
the files on the other serverssomething that alternatives like hard links or
shared disks do not offer. Backing up or making duplicate copies of virtual
environments is similarly improved.

Deduplication overview[edit]
Deduplication may occur "in-line", as data is flowing, or "post-process" after it has
been written.

Post-process deduplication[edit]

With post-process deduplication, new data is first stored on the storage device
and then a process at a later time will analyze the data looking for duplication.
The benefit is that there is no need to wait for the hash calculations and lookup to
be completed before storing the data thereby ensuring that store performance is
not degraded. Implementations offering policy-based operation can give users the
ability to defer optimization on "active" files, or to process files based on type and
location. One potential drawback is that you may unnecessarily store duplicate
data for a short time which is an issue if the storage system is near full capacity.

In-line deduplication[edit]
This is the process where the deduplication hash calculations are created on the
target device as the data enters the device in real time. If the device spots a block
that it already stored on the system it does not store the new block, just
references to the existing block. The benefit of in-line deduplication over postprocess deduplication is that it requires less storage as data is not duplicated. On
the negative side, it is frequently argued that because hash calculations and
lookups takes so long, it can mean that the data ingestion can be slower thereby
reducing the backup throughput of the device. However, certain vendors with inline deduplication have demonstrated equipment with similar performance to
their post-process deduplication counterparts.
Post-process and in-line deduplication methods are often heavily debated.[2][3]

Source versus target deduplication[edit]


Another way to think about data deduplication is by where it occurs. When the
deduplication occurs close to where data is created, it is often referred to as
"source deduplication," whereas when it occurs near where the data is stored, it is
commonly called "target deduplication."
Source deduplication ensures that data on the data source is deduplicated. This
generally takes place directly within a file system.[4][5] The file system will
periodically scan new files creating hashes and compare them to hashes of
existing files. When files with same hashes are found then the file copy is
removed and the new file points to the old file. Unlike hard links however,
duplicated files are considered to be separate entities and if one of the duplicated
files is later modified, then using a system called Copy-on-write a copy of that file
or changed block is created. The deduplication process is transparent to the users

and backup applications. Backing up a deduplicated file system will often cause
duplication to occur resulting in the backups being bigger than the source data.
Target deduplication is the process of removing duplicates of data in the
secondary store. Generally this will be a backup store such as a data repository or
a virtual tape library.

Deduplication methods[edit]

One of the most common forms of data deduplication implementations works by


comparing chunks of data to detect duplicates. For that to happen, each chunk of
data is assigned an identification, calculated by the software, typically using
cryptographic hash functions. In many implementations, the assumption is made
that if the identification is identical, the data is identical, even though this cannot
be true in all cases due to the pigeonhole principle; other implementations do not
assume that two blocks of data with the same identifier are identical, but actually
verify that data with the same identification is identical.[6] If the software either
assumes that a given identification already exists in the deduplication namespace
or actually verifies the identity of the two blocks of data, depending on the
implementation, then it will replace that duplicate chunk with a link.
Once the data has been deduplicated, upon read back of the file, wherever a link
is found, the system simply replaces that link with the referenced data chunk. The
deduplication process is intended to be transparent to end users and applications.
Chunking. Between commercial deduplication implementations, technology varies
primarily in chunking method and in architecture. In some systems, chunks are
defined by physical layer constraints (e.g. 4KB block size in WAFL). In some
systems only complete files are compared, which is called Single Instance
Storage or SIS. The most intelligent (but CPU intensive) method to chunking is
generally considered to be sliding-block. In sliding block, a window is passed along
the file stream to seek out more naturally occurring internal file boundaries.
Client backup deduplication. This is the process where the deduplication hash
calculations are initially created on the source (client) machines. Files that have
identical hashes to files already in the target device are not sent, the target
device just creates appropriate internal links to reference the duplicated data. The
benefit of this is that it avoids data being unnecessarily sent across the network
thereby reducing traffic load.
Primary storage and secondary storage. By definition, primary storage systems
are designed for optimal performance, rather than lowest possible cost. The
design criteria for these systems is to increase performance, at the expense of
other considerations. Moreover, primary storage systems are much less tolerant
of any operation that can negatively impact performance. Also by definition,
secondary storage systems contain primarily duplicate, or secondary copies of
data. These copies of data are typically not used for actual production operations
and as a result are more tolerant of some performance degradation, in exchange
for increased efficiency.
To date, data deduplication has predominantly been used with secondary storage
systems. The reasons for this are two-fold. First, data deduplication requires
overhead to discover and remove the duplicate data. In primary storage systems,
this overhead may impact performance. The second reason why deduplication is
applied to secondary data, is that secondary data tends to have more duplicate

data. Backup application in particular commonly generate significant portions of


duplicate data over time.
Data deduplication has been deployed successfully with primary storage in some
cases where the system design does not require significant overhead, or impact
performance.

Drawbacks and concerns[edit]


Whenever data is transformed, concerns arise about potential loss of data. By
definition, data deduplication systems store data differently from how it was
written. As a result, users are concerned with the integrity of their data. The
various methods of deduplicating data all employ slightly different techniques.
However, the integrity of the data will ultimately depend upon the design of the
deduplicating system, and the quality used to implement the algorithms. As the
technology has matured over the past decade, the integrity of most of the major
products has been well proven .[citation needed]
One method for deduplicating data relies on the use of cryptographic hash
functions to identify duplicate segments of data. If two different pieces of
information generate the same hash value, this is known as a collision. The
probability of a collision depends upon the hash function used, and although the
probabilities are small, they are always non zero. Thus, the concern arises
that data corruption can occur if a hash collision occurs, and additional means of
verification are not used to verify whether there is a difference in data, or not.
Both in-line and post-process architectures may offer bit-for-bit validation of
original data for guaranteed data integrity.[7] The hash functions used include
standards such as SHA-1, SHA-256 and others. These provide a far lower
probability of data loss than the risk of an undetected and uncorrected hardware
error in most cases and can be in the order of 1049% per petabyte (1,000
terabyte) of data.[8]
The computational resource intensity of the process can be a drawback of data
deduplication. However, this is rarely an issue for stand-alone devices or
appliances, as the computation is completely offloaded from other systems. This
can be an issue when the deduplication is embedded within devices providing
other services. To improve performance, many systems utilize weak and strong
hashes. Weak hashes are much faster to calculate but there is a greater risk of a
hash collision. Systems that utilize weak hashes will subsequently calculate a
strong hash and will use it as the determining factor to whether it is actually the
same data or not. Note that the system overhead associated with calculating and
looking up hash values is primarily a function of the deduplication workflow. The
reconstitution of files does not require this processing and any incremental
performance penalty associated with re-assembly of data chunks is unlikely to
impact application performance.
Another area of concern with deduplication is the related effect
on snapshots, backup, and archival, especially where deduplication is applied
against primary storage (for example inside a NASfiler).[further explanation
needed] Reading files out of a storage device causes full reconstitution of the
files, so any secondary copy of the data set is likely to be larger than the primary
copy. In terms of snapshots, if a file is snapshotted prior to deduplication, the

post-deduplication snapshot will preserve the entire original file. This means that
although storage capacity for primary file copies will shrink, capacity required for
snapshots may expand dramatically.
Another concern is the effect of compression and encryption. Although
deduplication is a version of compression, it works in tension with traditional
compression. Deduplication achieves better efficiency against smaller data
chunks, whereas compression achieves better efficiency against larger chunks.
The goal of encryption is to eliminate any discernible patterns in the data. Thus
encrypted data cannot be deduplicated, even though the underlying data may be
redundant. Deduplication ultimately reduces redundancy. If this was not expected
and planned for, this may ruin the underlying reliability of the system. (Compare
this, for example, to the LOCKSS storage architecture that achieves reliability
through multiple copies of data.)
Scaling has also been a challenge for deduplication systems because ideally, the
scope of deduplication needs to be shared across storage devices. If there are
multiple disk backup devices in an infrastructure with discrete deduplication, then
space efficiency is adversely affected. A deduplication shared across devices
preserves space efficiency, but is technically challenging from a reliability and
performance perspective.[citation needed]
Although not a shortcoming of data deduplication, there have been data breaches
when insufficient security and access validation procedures are used with large
repositories of deduplicated data. In some systems, as typical with cloud storage,
an attacker can retrieve data owned by others by knowing or guessing the hash
value of the desired data.[9]

You might also like