How much data storage do businesses need for storage and backup? Four of the biggest online storage internet companies (Google, Amazon, Microsoft, and Facebook) store at least 1,200 petabytes (PB), which is 1.2 million terabytes (TB). Even for smaller companies, it is remarkable how much data it manages.
The rising costs of data storage
According to IDG Data and Analytics Survey, the average volume of data managed by company size is:
- Enterprise company: 350 TB of data
- Mid-level company: 160 TB of data
- Small business: 50 TB of data
Let’s translate that to the actual cost. Companies today are paying for data storage more than ever. 1 TB of cloud data storage is approximately $21 per month from Amazon AWS, Google, and Microsoft Azure. If we take this number and multiply it by the average volume of data managed by company size, we can estimate the average annual cost of data storage based on the size of the company:
- Enterprise: $88,200
- Mid-level company: $40,320
- Small business: $12,600
As shown, data storage cost is not dismissable regardless of the size of the company. Also, many companies are backing up their data in the event their data are lost or corrupted, which allows them to restore the data immediately and continue business operations. This means paying for data backup storage, which costs them another 20% to 40% on top of the storage cost. Lastly, depending on the company, they also have to pay for additional overhead for data management.
Related: Backup and Disaster Recovery Software Secure Business Operations → |
Eventually, many companies realize the true cost of data storage and are interested in ways to reduce it. There are many ways to reduce the cost, such as performing file compression or opting for cheaper vendors, but one of the best ways is data deduplication. This technology allows the storage software to delete duplicate data which saves storage space.
In this article, we will explore what deduplication is and how it works.
What is Deduplication?
Deduplication is the process of removing redundant data so extra copies of data will not take up space.
There are many deduplication methodologies, but in general, deduplication breaks down data into blocks and assigns a hash value to every block. Each time when a new block of data comes in, the software checks if the hash value of the new block is the same as the old blocks. If they are the same, then it is replaced with an identifier that points to the old data block. This avoids saving replicated data in the same storage environment.
Deduplication methods: what are they and how are they different?
-
Post-processing deduplication is deduplication after storage.
For this method to work, the data must be transferred across the network first before deduplication. This requires high capacity storage hardware and bandwidth because the data is transferred in its raw size. After the transfer, the software initiates the duplication process and compresses the data afterward.
When there is limited performance on the client device, choosing post-processing deduplication helps as it doesn’t require much computing capacity on the client side. The data will only be deduped on the storage side instead.
-
Inline processing deduplication is deduplication before storage.
The software completes the deduplication process before the data transfers across the network to storage. This process requires high computational power since the deduplication process starts on the client side. However, the reduced-sized data consume less storage and bandwidth, which usually outweighs the cost of computational power.
When there is a limited disk capacity on the target device, choose inline processing is recommended because it deduped and compressed data before sending the data to the target storage.
How effective is data deduplication?
The effectiveness of deduplication depends on the ratio between the original size of data and its size after the redundancy is removed. Let’s look at two deduplication ratios:
- 100:1 - 100 GB of data require 1 GB of storage capacity, resulting in 99% space savings
- 2:1 - 2 GB of data require 1 GB of storage space, resulting in 50% space savings
The higher the ratio, the more redundant copies of the original data exist. In the first case, deduplication would be highly effective because it can remove a lot of redundant data. In the second case, it is less effective because there is less redundant data.
Factors that affect deduplication ratio: |
|
A quick note on data compression
Compression is another popular data storage optimization technique. It is an algorithm process that shrinks the volume of data by replacing identical sequence data with the number of times it appears in a row. While it saves space, it requires decompression to make the data available again.
Both deduplication methods use compression, but the inline processing method benefits more since compressed data requires less network bandwidth to transfer than raw data. For example, when downloading a large application, it is usually compressed into a RAR file as it takes less time to download a reduced-sized file. It must be noted that compression is a CPU-intensive activity so if the client device is too old or slow, it may be stuck and crash.
Data deduplication is the way forward
Deduplication technology can reduce storage and network costs by removing redundant data. Companies don’t have to invest in data deduplication hardware since many deduplication processes can be done on the cloud or on the workstation. Software that includes deduplication also comes with features for compression, so the user can save even more space.
¿Quieres aprender más sobre Herramientas de Calidad de Datos? Explora los productos de Calidad de los datos.

Tian Lin
Tian is a research analyst at G2 for Cloud Infrastructure and IT Management software. He comes from a traditional market research background from other tech companies. Combining industry knowledge and G2 data, Tian guides customers through volatile technology markets based on their needs and goals.