regarding hashing

Forum

Forum
General C++ Programming
regarding hashing

Dear all,

I have a big set of strings, let us say S={str_1, str_2, ..., str_n}. I give this set to a number of network nodes, let's say N={N_1, N_2,..., N_k}. Each node has a hash function. I need the output of the hash functions of all the nodes for each string to be unique. For example, if I give str_1 as the input to the hash functions of N_1 and N_2, I want both of these hash functions to return the same value. How to guarantee this in C++?

Thanks and regards,
A

helios (17504)

I need the output of the hash functions of all the nodes for each string to be unique.

For example, if I give str_1 as the input to the hash functions of N_1 and N_2, I want both of these hash functions to return the same value.

These two statements are not equivalent.
Your first statement is requiring that given x and y, if x != y then hash(x) != hash(y).
Your second statement is requiring that if x == y then hash(x) == hash(y).

Your second statement is easy to satisfy. Basically any hash function will at least have that property.

Your first statement is impossible to satisfy. A hash function maps an infinite set (the set of all strings or arbitrary length) to a finite set (the set of all n-bit integers). By the pidgeonhole principle, necessarily some strings will hash to the same value.
The question is how unlikely you want hash collisions to be.

Allan89 (2)

Thanks a lot for your reply. Indeed, the second statement is important to me. Consider that I use std::hash for all the nodes. If at node_1, hash(str_1)=M, then at node_2 also hash(str_1) will be definitely equal to M?

Thanks.

helios (17504)

Only if you're distributing the exact same build to each machine, or you're using the same compiler on each machine. The behavior of std::hash is implementation-defined, so for example while GCC might use one hash algorithm to implement std::hash(), Clang might use a different algorithm, and MSVC might use yet another one. std::hash() is not really intended to shared the hash value across processes meaningfully. It's intended to implement hash lookup data structures.
It would probably be a good idea to use a different implementation.

Browni3141 (578)

Writing your own hashing function shouldn't be terribly hard. For strings of arbitrary length, the first thing that comes to mind is to generate a random 64-bit integer for each character in each position. The hash of the string is the result of xoring the generated hashes of each individual letter at each position.As long as you always use the same seed, random number engine and generation process you should get the same hashes every time.

This is just an example of one possible way to do it. Collisions are nearly impossible unless the set of keys is large, and unless the input strings are very long or you are performing a very large number of hashing operations it should be efficient enough. It may or may not suit your purposes and it may or may not be a reasonable way to make a hashing function. I haven't studied them but I'm sure there are other explanations of how to do it on the web.

If collisions are disastrous, there are ways to handle them. Oftentimes they are acceptable if they are very rare, but it depends on your circumstances.

Last edited on

jonnin (11329)

billions of ways to do it. my first cut at a hash function is to take the 8 most likely to be changed bytes in the input and drop them into a 64 bit int and use that to seed <random>. The generated random number (in your hash table range) is the hash value.

If that is troublesome (8 bytes insufficient being the usual suspect) I will use sha-1 on the input and do the same thing, seed random and pull off the answer.

These are easy to code but a bit inefficient for gigantic data. If you find it too slow, feel free to write a faster one; same ideas... find some bytes that you can key from and turn them into a number in range of the table. The above is just a lazy way to get an approximate uniform distribution with minimal brain cells used :P

I don't usually deal with collisions. I usually have known data (eg static lookup data) that I can hash pure without dealing with that. But the random thing is great there too, just get the next value in the sequence until you find an empty cell...

Last edited on

dhayden (5795)

Each node has a hash function.

Are you saying that each hash function is different? In other words, are they using different hashing algorithms?

If not different algorithms then what's the point of having several nodes?

If the functions are different and you want the output of each hash function to be the same, then... they aren't different functions! :)

Topic archived. No new replies allowed.