The funny thing is that data deduplication storage systems work on sha-1 hashes ...

xorcist · on Oct 18, 2016

The risk that two blocks would hash identically by accident is astronomically small, but there is indeed some possibility that a skilled person might construct two blocks that do.

So as long as your data on disk hasn't evolved intelligence and actively tries to create two blocks with the same hash, your storage system is absolutely fine.

AgentME · on Oct 19, 2016

Well, and as long as you're not storing data from users and one uploads some data specifically constructed to collide.

tptacek · on Oct 18, 2016

Security systems must retain collision-resistance even in the presence of adversaries. If your design doesn't have adversaries, you can use cheaper, less secure hashes. Nobody would still recommend SHA-1, though.

khc · on Oct 19, 2016

That's not always true: Riverbed Whitewater (now Netapp AltaVault) uses sha256.

Source: I wrote that hash-based dedup code

sllabres · on Oct 19, 2016

Very interesting.

Can you tell why SHA256 was chosen? Was it because of the larger hash, or was it a requirement to have a cryptographic secure hash?

Do you know how the hash is used in the Netapp in regard to the remark MichaelMoser123 made: "(two blocks are considered equal if the hash is equal - most systems don't care to check for collisions" [...]

Is it (overoversimplifed ;):

  if sha256(a) == sha256(b)  {
    performDedup()
  [...]

Or is a compare added if the hashes found to be identical:

  if ( sha256(a) == sha256(b) and 
       byteCompare(a) == byteCompare(b) 
     ) {
    performDedup()
  [...]

khc · on Oct 26, 2016

sha256 was chosen over sha1 because of the larger hash space making collision much less likely. These systems can store 100s of TBs to PBs and at that number of hash blocks you have to start worrying about the birthday paradox.

radlad · on Oct 18, 2016

I think many systems use two hashing methods to verify this claim (e.g. SHA-1 and MD5.) Are there any known collisions that affect both algorithms?

yoha · on Oct 19, 2016

You are talking about making a combined hash function from MD5 and SHA-1. Keep in mind that it is not trivial. For instance, if you just output both SHA-1 and MD5, you are actually only as strong as the weakest hash function (trivially for preimage, but also true for collisions, see [1]).

If you are interested in hash function combiners, have a look at Robust Multi-Property Combiners for Hash Functions [2], a recent paper on the topic. It aims at getting several properties at the same time (preimage resistance, collision resistance). For simpler schemes with a single property, just explore the bibliography at the end of the paper.

[1] https://cryptopals.com/sets/7/challenges/52

[2] https://eprint.iacr.org/2016/723

xeeeeeeeeeeenu · on Oct 19, 2016

What you are saying is completely irrelevant, because we are talking about checksumming data, not hashing passwords.

AgentME · on Oct 19, 2016

Using two algorithms together is basically just creating a new hash algorithm (with a longer hash), which could also have collisions.

If you know a technique for making SHA-1 collisions and a technique for making MD5 collisions, it's probably possible to combine the techniques and make a collision for SHA-1-MD5.

geon · on Oct 19, 2016

AFAIK, no sha1 collissions have ever been observed.

mcpherrinm · on Oct 19, 2016

That probably won't be true in a year, though. That's why SHA1 deprecation is in full force.