Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The funny thing is that data deduplication storage systems work on sha-1 hashes to identify identical data blocks (two blocks are considered equal if the hash is equal - most systems don't care to check for collisions); they kept telling that the likelyhood of collisions is so astronomically small [1] ; i wonder if this statement is still considered to be true.

[1] https://pibytes.wordpress.com/2013/02/09/deduplication-inter...



The risk that two blocks would hash identically by accident is astronomically small, but there is indeed some possibility that a skilled person might construct two blocks that do.

So as long as your data on disk hasn't evolved intelligence and actively tries to create two blocks with the same hash, your storage system is absolutely fine.


Well, and as long as you're not storing data from users and one uploads some data specifically constructed to collide.


Security systems must retain collision-resistance even in the presence of adversaries. If your design doesn't have adversaries, you can use cheaper, less secure hashes. Nobody would still recommend SHA-1, though.


That's not always true: Riverbed Whitewater (now Netapp AltaVault) uses sha256.

Source: I wrote that hash-based dedup code


Very interesting.

Can you tell why SHA256 was chosen? Was it because of the larger hash, or was it a requirement to have a cryptographic secure hash?

Do you know how the hash is used in the Netapp in regard to the remark MichaelMoser123 made: "(two blocks are considered equal if the hash is equal - most systems don't care to check for collisions" [...]

Is it (overoversimplifed ;):

  if sha256(a) == sha256(b)  {
    performDedup()
  [...]
Or is a compare added if the hashes found to be identical:

  if ( sha256(a) == sha256(b) and 
       byteCompare(a) == byteCompare(b) 
     ) {
    performDedup()
  [...]


sha256 was chosen over sha1 because of the larger hash space making collision much less likely. These systems can store 100s of TBs to PBs and at that number of hash blocks you have to start worrying about the birthday paradox.


I think many systems use two hashing methods to verify this claim (e.g. SHA-1 and MD5.) Are there any known collisions that affect both algorithms?


You are talking about making a combined hash function from MD5 and SHA-1. Keep in mind that it is not trivial. For instance, if you just output both SHA-1 and MD5, you are actually only as strong as the weakest hash function (trivially for preimage, but also true for collisions, see [1]).

If you are interested in hash function combiners, have a look at Robust Multi-Property Combiners for Hash Functions [2], a recent paper on the topic. It aims at getting several properties at the same time (preimage resistance, collision resistance). For simpler schemes with a single property, just explore the bibliography at the end of the paper.

[1] https://cryptopals.com/sets/7/challenges/52

[2] https://eprint.iacr.org/2016/723


What you are saying is completely irrelevant, because we are talking about checksumming data, not hashing passwords.


Using two algorithms together is basically just creating a new hash algorithm (with a longer hash), which could also have collisions.

If you know a technique for making SHA-1 collisions and a technique for making MD5 collisions, it's probably possible to combine the techniques and make a collision for SHA-1-MD5.


AFAIK, no sha1 collissions have ever been observed.


That probably won't be true in a year, though. That's why SHA1 deprecation is in full force.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: