Maybe God Does Play Dice - Hash Based Data Deduplication

Albert Einstein said of quantum mechanics: “I, at any rate, am convinced he does not throw dice” (he meaning god) often paraphrased as “God does not play dice.”This is probably a vision of one of the most profound scientific discontinuities of all time. Just as Newtonian physics and geometry was broken by Einstein's theory of relativity. Einstein's view of the predictable universe was shattered by quantum mechanics.

This is pretty heavy stuff for a computer technology essay, but there is a point. One of the more interesting things to happen in the data storage technology in the last few years is the technique of data block deduplication. This is a process by which computer files are analyzed for blocks, (fixed length quantities of data) that are the same as other blocks within the file or in other files. Then, as blocks are found to be duplicative, only a single block is stored with the other identical blocks referencing the first one. It seems pretty obvious, right?

In old fashioned data theory, block level deduplication was not practical. Why? Well, if your data block contained 16K (16384 bytes), then you really couldn't represent that much data in anything less than 16384 bytes. That block could be ANYTHING within the size. The amount of additional data required to represent duplicate blocks would be bigger than any sort of savings you could get. The problem is that you can't 100% guarantee all the possible contents of a block with any less data than the block can contain. Thus, any attempt to try to remove duplicate blocks invariably would fail or be so specialized having a priori knowledge of the data as not to be practical.

This is where, I believe, the quantum mechanical view of things is shaking up the old data theory. What if you could reasonably represent 16K of data with 32 bytes of data? Obviously, from a 1:1 numerical point of view, it can't be done, but it can be done on a probabilistic basis. If you could create a pseudo random number with a “high probability” of uniqueness that represented a block of data, then maybe it could be done.

Enter the SHA-2 series of hashing algorithms. Secure Hash Algorithm, or SHA, is a cryptographic “hash.” A hash is a smaller numerical representation of a much larger piece of data. This essay won't go into to detail about hashing, because if you know it, it is redundant, if you don't know what hasing is you could find a better description on Wikipedia. SHA is “cryptographic” because it is difficult or impossible to reverse the hash and generate source data and it is unlikely that any two data sources will produce the same hash value. These very qualities lend it well do the problem of deduplication.

To fully understand this, lets discuss some numbers. A SHA-2 256 hash is a 32 byte pseudo-random number. A 32 byte number has 256 bits. That means that there are 2^256 possible hash values. (115792089237316195423570985008687907853269984665640564039457584007913129639936) With that many possible hash values, there is a 1 in 2^256 chance of duplicate.

So, if we assume the SHA-2 hash is as random as it claims to be, and there is good proof it is fairly good. If we assume a finite and limited amount of source data, then we can reasonably conclude that it is very unlikely that any two differing blocks encountered on a system will produce the same hash.

There is a lot of math to verify or disprove this, of course. One of the more interesting reads is the “birthday paradox.” (Look it up on google) The “birthday paradox” is a statistical treatment of the likelihood that in any random set of people that two would have the same birthday. It is paradoxical because the chance of a set of 23 people, with no twins, has a 50% chance of having two people with the same birthday seems absurd. The model is directly relatable to the statistical block level deduplication. Again, there are better treatments of this elsewhere and you are encouraged to look.

This leads us to the quote “God does not play dice.” Well, here we need to make a few mental leaps. The old school physicists believed that the universe would eventually prove to be 100% predictable. The quantum mechanical physicists came to realize that uncertainty is part of the universe. (google: schrodinger's cat)

Like the old school physicists, the old school data theory is just as wrong. Yes, it is conceded that block deduplication using a hash algorithm has a chance, no matter how small, that there will be an error in the data, but it “probably” won't. Here's where we get into the philosophical aspect of this. Conceptually, old school data guys (like myself, admittedly), don't think in probability. We think in predictability. When we design data systems, we design for reliability and we design for predictability. We would never intentionally design a system that “probably” works. Yet, as it turns out, that is all anyone can do.

Everything is a probability. No matter how reliable and predictable we design a system to be, there is always a probability of failure. Hardware fails, so we have redundancy. Two pieces of hardware can fail simultaneously or close to simultaneously, the risk is low, but it is still a risk. So, triple redundancy? What about quadruple redundancy? What if a meteor hits the building? What about earthquakes? We make backups. We have off site storage for backups.

We are so smug and so sure we'll never lose a byte of data. Yet, it still happens. Machines crash. Off site backups are corrupted. The probability seems astronomical that these things will happen at the same time, but people win the lottery on a fairly regular basis. It happens no matter how well designed things seem to be. Just look at what happened at Amazon towards the end of April 2011, inconceivably service was disrupted and data was lost despite all their precautions.

Normally, we isolate the extremely unlikely events and ignore them. We think they don't really impact our designs. That if they happen, that really is a different class of problem, and not “our” problem. Should, for instance, the web site design include provisions for meteors or earthquakes? Regardless of how it happens, the result is the same. The service experiences an outage and/or data loss.

If we accept that every data storage design has some probability of failure, how much probability of failure are we willing to risk? Back to bock level deduplication, when you run the numbers, the probability of failure (2 or more different blocks having the same hash value) is extremely low. Lower than hardware failure, probably lower than meteor strike.

The odds of getting killed by lightning: 2,320,000 to 1
The odds of a meteor landing on your house: 182,138,880,000,000 to 1