Collision free hash

closed account (D4S8vCM9)
Hi,

I have following method to calculate a 16bit hash on a per definition unique string:

1
2
3
uint16_t IReferenceHolder::getHash(const std::string &ref) {
	return std::inner_product(ref.begin(), ref.end(), ref.begin(), (uint16_t)0) % 65437;
}


Actually it calculates an inner product on each character adding it to the last result (see http://www.cplusplus.com/reference/numeric/inner_product/).

65437 is the nearest prime to 2^16. In fact I never need the full range of 16 bit, therefore 65437 different hashes are enough.

Empirically it seems - on unique strings - to be collision free. But if not, can somebody construct me an example where not.

NOTE the strings have an maximum size of 255.
Last edited on
There is no such thing as a collision free hash.

This is the classic problem of trying to fit too many things into a fixed number of slots. You cannot represent every possible string with just a single 16-bit integer.

The absolute best case scenario is 2^16 unique strings before you have a collision.

To reproduce a collision, just hash (2^16)+1 strings. At least 2 of them will collide (though you're likely to have a collision much sooner):

1
2
3
4
5
6
for(int i = 0; i < 0x10001; ++i)
{
   stringstream ss;
   ss << i;
   hash( ss.str() );
}




EDIT:

Also, given the simplicity of your algorithm, I can tell that "ab" and "ba" will collide. So there's a test case for you.
Last edited on
closed account (D4S8vCM9)
Ups, you are true :-[ 2 Byte string already start to collide.
You're also true with the fact "There is no such thing as a collision free hash.", I was just too lax in describing my problem, wanted to say "almost" collision free.

The absolute best case scenario is 2^16 unique strings before you have a collision.

This condition is ensured. But:

In your example the pitfall is the commutativity of the plus and multiply operator. That pitfall I encountered in my first try, but I openly admit, I didn't used that small strings to test.

But minus falls out because it can result in negative numbers and division can result in real numbers.

Any ideas how I can get at least "less" collisions?
closed account (D4S8vCM9)
Took my Segdewick book again and found another simple, but better algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
unsigned int RSHash(const std::string& str)
{
   unsigned int b    = 378551;
   unsigned int a    = 63689;
   unsigned int hash = 0;

   for(std::size_t i = 0; i < str.length(); i++)
   {
      hash = hash * a + str[i];
      a    = a * b;
   }

   return hash;
}


Run for AB and BA:

#1: hash = 0 * 63689 + 65 = 65
a = 24109534639

#2: hash = 65 * 24109534639 + 66 = 1567119751601
a = 24109534639

#1: hash = 0 * 63689 + 66 = 66
a = 24109534639

#2: hash = 66 * 24109534639 + 65 = 1591229286239
a = 24109534639


Just should take internally 64 bit in order to avoid integer overflows, but else it seems more stable but yet simple.


EDIT: Implemented it as following:

Declaration in ireferenceholder.h:
1
2
3
4
5
6
typedef struct hashCalculator : std::binary_function<uint64_t, uint64_t, uint16_t> {
	inline hashCalculator() : a(63689u), ab(63689u * 378551u) {}
	uint64_t operator()(uint64_t, uint16_t);
	uint64_t a;
	const uint64_t ab;
} HASHFUNC;

Implementation in ireferenceholder.cpp:
1
2
3
4
5
6
7
8
9
uint64_t IReferenceHolder::hashCalculator::operator()(uint64_t hash, uint16_t chr) {
	uint64_t h = hash * a + chr;
	a = ab;
	return h;
}

uint16_t IReferenceHolder::getHash(const std::string &ref) {
	return std::accumulate(ref.begin(), ref.end(), (uint64_t)0, HASHFUNC())  % 65437u;
}
Last edited on
Topic archived. No new replies allowed.