How to combine hash values using std::hash

I have a type with 2 int data members. I want to calculate a GOOD hash value and since std::hash exists I thought I would use it - after all it must be GOOD to be in the std.

However there is no hash_combine and no clues as to how to combine the results to get a hash value that is GOOD for my type.

So here is my code:

1
2
3
4
5
6
7
8
9
	template<> struct hash<Point>
	{
		std::size_t operator()(Point const& obj) const noexcept
		{
			std::size_t h1 = std::hash<int>{}(obj.x);
			std::size_t h2 = std::hash<int>{}(obj.y);
			return h1 | ~h2;
		}
	};


Does this produce a GOOD hash_value for my type Point?

Regards,
Juan
Rather than (x | ~y), do (x ^ y).
helios wrote:
Rather than (x | ~y), do (x ^ y).

... Because for two uniform random bits a and b, the value of a | b is true 75% of the time.

std::hash<int> is probably the identity function.
I tried it and it did not work - got many collisions (was not a GOOD hash value)!! I had many

hash(a) == hash(b) with a != b.

I got better results with | ~ but still want to use what is the best "joiner" in the industry!!

(cannot afford collisions)

the value of a | b is true 75%


I find that a | b is a bad hash function because it has many collisions!
however how do you calculate 75%?
for 2 bit values, a | b == 11 for the following inputs of a and b in binary can be:

a b
10 01, 11
01 10, 11
11 00,01,10,11
00 11


thats 9 combinations of (a,b) giving us the same value 11!

Terrible hash functions
How do you calculate 75%?

By examining the truth table:
| a | b | a|b |
| 0 | 0 |   0 |
| 0 | 1 |   1 |
| 1 | 0 |   1 |
| 1 | 1 |   1 |

The hash joiner function a → a ^ b is bijective. This is desirable because it preserves the information in both a and b.

Do you really need a more generic hash joiner or are you just hashing 2D points?

std::hash<int>::operator() returns std::size_t. If you can assume 4-byte int and 8-byte std::size_t just use the perfect hash
1
2
static_assert(sizeof(int) == 4 && sizeof(std::size_t) == 8);
std::size_t hash = (static_cast<std::size_t>(obj.x) << 32) | obj.y;
Last edited on
I need a perfect hash - if it isn't perfect then it won't work.
I require that hash(a) == hash(b) only when a == b

I am using a cache (a map) where the key is the hash value. Every time I get a point I get its hash and test for presence in the cache. If I have a new point with a hash value that has already been inserted in the cache, then this new point will not be added to the cache. I cannot have this behavior cause it looses Points!


I think the key would have to be the Point itself, seems thats the only perfect way to ensure all Points are taken into the cache and none is skipped.

Regretably, I cannot use the perfect hash from @mbozzi because sizeof(size_t) == 4 in my platform.

Am I right?

I need a perfect hash - if it isn't perfect then it won't work.
I require that hash(a) == hash(b) only when a == b
Perfect hashes have this property, but the hash function is pre-computed by seeding a PRNG such that for all i, j where 0 <= i < n and 0 <= j < n, (input[i] = input[j] <=> hash(input[i]) = hash(input[j])) and 0 <= hash(x) < n. The trade-off is that for values not contained in input, the hash function necessarily produces a collision, so a perfect hash function can't be used as an std::unordered_set replacement, because insertion takes O(n).

I think the key would have to be the Point itself, seems thats the only perfect way to ensure all Points are taken into the cache and none is skipped.
Then you don't want a hash table. You want an std::map. Otherwise you have to deal with a gigantic array that's mostly empty.
Last edited on
Just checking - you're aware of conventional collision-resolution methods like linear probing, right?

the hash function is pre-computed

e.g., by using a tool like gperf
https://www.gnu.org/software/gperf/
But only if you know your input ahead-of-time.

If not, since there there are 64 bits of Point, a perfect hash function must produce values at least 64 bits wide.
Last edited on

The trade-off is that for values not contained in input


what is input here? how is that function defined?
Last edited on

the hash function is pre-computed


how is this function created? Any samples?
perfect hashing is usually reserved for a static set of known data such that you can research it and ensure the result.
you can TRY using GUID type algorithms but your table cannot be large enough to use a full blown GUID (these tend to be 128 bits or something quite large, you can't make a 2^128 table on most normal computers).

and then you get into 'discrete math' gibberish that boils down to an obvious fact: the fewer # of slots in your table, the higher the probability of a collision. A good hash algorithm should be 'perfect' when a table is 4-5 times the size of the number of items being put into it, but you are in vegas on this (playing the odds without an assured result) and it won't work every time, eventually the input data will collide and you will have some sort of bug.

Bottom line is this... perfect hashing only works for a very small amount of data in the universe; if your data has a unique 'key' that can be converted to a unique array index in an array that is small enough to be useful on the current hardware, then you can do it. Eg if you have a tiny company and your customer IDS are from 0 to 1000, yes, you can hash that into an array of [1001] perfectly. And given modern computer power, you can even do that up to a few million without sweating too hard. When your table starts getting into the GB range (this is about 50 million slots assuming a 64 bit int key and a 64 bit pointer to data item make up the table) its starting to stretch a pretty good PC as of 'today' -- and a lot of users would be annoyed at one program burning up over a gb for more than a few moments.
Last edited on
Topic archived. No new replies allowed.