Sunday, November 22, 2009  
Google
Web pcquest.com

CIOL Network sites

Search by Issue | Sitemap | Advanced Search

• For most updated version of DQ TOP 20 issue, visit dqindia.com • Ad : Play and Plug ERP by IBM
 Home > Technology

Data De-Duplication

How effectively you manage important data is critical to your business success. This technology helps you by eliminating redundant data

Swapnil Arora

Tuesday, October 06, 2009

Print Comment Email DiggDigg DeliciousDel.icio.us RedittReddit TwitterTwitter

No matter what kind of business are you in, your data is bound to grow. This makes you add to your storage space continually. However, at times multiple copies of the same data seem to occupy your storage pool. For instance, a presentation which talks about the products of your company might be stored by various users in various departments in your company, resulting in wastage of storage space. Adding to this is the fact that whenever you take the backup of your primary storage device, multiple copies of the same data get duplicated again on the backup storage, be it a tape or a disk based storage. You can do away with all this duplication with the help of data de-duplication technologies.

Data de-duplication refers to removal of redundant data. In the de-duplication process, a single copy of data is maintained along with the index of the original data, so that data can be easily retrieved when required. Other than saving disk storage space and reduction in hardware costs, (storage hardware, cooling, backup media, etc), another major benefit of data de-duplication is bandwidth optimization.

Data de-duplication can be deployed in two ways -source based and target based. The source based de-duplication is done before the backup i.e, at primary storage such as NAS, while in target method, de-duplication is done after the backup. However, in a target based method, de-duplication can also be during the backup, which is known as inline de-duplication. The benefit of in-line deduplication over post-process deduplication is that it requires less storage as data is not duplicated unlike post-process de-duplication.

Source based data de-duplication is usually deployed in environments such as, file-systems, remote branch office environments and virtualization environments. In remote backup scenario, the source based data de-duplication also means that there will be less data traveling through the WAN pipe, resulting in effective bandwidth utilization. Target based de-duplication is a good option where bandwidth is not an issue, such as SAN or LAN backup environments.

How it works?
Largely, there are three techniques used by data de-duplication vendors, file level, block level and byte level. File level de-duplication, also known as single instance stores (SIS), searches for identical files on the disk and eliminates identical ones. The biggest drawback of this method is that if the same file is present with two different names, it won't be eliminated. Block level de-duplication works at more granular level as compared to file level de-duplcation. Here data is broken down to blocks which can be any logical or fixed length blocks and the de-duplication solution looks for unique blocks (most solutions do this by calculating hash). When a unique block is stored, its identifier is created in the index. Now, whenever a repeated block comes across, instead of storing the entire block, a pointer to the existing block is placed in the index thus saving the storage space. Block level de-duplication offers various advantages over the file level de-duplication. The same file with two different names not removed by file based de-duplication will be easily removed in block level de-duplication. Also, if only a part of the file is modified, the modified part will be stored uniquely as compared to the entire data.

Byte level data de-duplication is mostly used in post-processing scenarios. Here, new data is compared at byte level with already existing data and only the changes are stored. Byte level de-duplication can deliver accurate backups. As byte by byte comparison is time consuming, which is the precise reason de-duplication is done after backup, but before data is finally written. However, the catch here is that, this requires extra disk space, to ensure there is enough space for de-duplication to be done while data is on hold. For block level de-duplication to work effectively, data needs to be broken into very small chunks, mostly around 8kb. The drawback here is, the smaller the block size, the more the entires in hash table, and handling the table in itself can become a challenge. Compare to this, byte level stores data in large segments, mostly around 100MB.

Benefits
In an enterprise, most redundant data comes from backing the same data again and again. Depending upon the environment and type de-duplication technology used. Vendors claim, enterprises can achieve de-duplication ratio of 50:1 in source based scenario, and 20:1 in target based scenario. However, before choosing a particular solution, it's important to find out which technologies will suit your current environment and what are your priorities. For instance, are you looking to cut down your WAN costs along with the storage costs? Also a target based de-duplication can come as an add-on to your existing data backup solution. However, for source based, you might need to deploy the solution from the scratch. Last but not the least, data de-duplication is also considered as a green technology as reduction in storage space also means less power consumption and reduction in carbon emission.

Page(s)   1  

Print Comment Email DiggDigg DeliciousDel.icio.us RedittReddit TwitterTwitter


Untitled Document



ZTE:Leading CDMA Technology


Extraordinary Networks:Freedom of Choice


   
 

 
 

Magazine Subscription | RQS | Contact Us | Team PCQuest | Advertising - Print | jobs@cybermedia