Fastest associative array on C++ (Trie)

Hi Guyz,
I have developed one of the fastest associative arrays in the world, based on Trie structures. It's quite serious implementation, more than 8000 code lines on C++.

It's much faster than std::map and in many cases even faster than std::unordered_map.
For example inserts of 10M random keys (16 bytes each) processed within 1.3 sec but std::map spent on the same test about 14 secs.

Codebase with benchmarks available here
https://github.com/Bazist/HArray

Also my algorithm has next advantages in contrast to hashtables:
- Prefix compression
- Ordered keys. Possibility create custom ordering
- Scan ranges of keys, sub ranges of keys
- Very fast serialization\desreialization on disc.
- Fair delete keys with smoothly releasing memory

Best Regards,
Viacheslav
Last edited on
I find it hard to believe that a trie could be so much faster than a BST during insertion. Could you redo your tests using the following insertion method?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
bool comparison_function(const std::string *a, const std::string *b){
    return *a < *b;
}

std::map<std::string *, value_t, comparison_function> map;
std::vector<std::string> allocation_vector;
//populate allocation_vector

auto t0 = get_time();
for (auto &s : allocation_vector)
    map[&s] = /*some value*/;
auto t1 = get_time();

auto insertion_time = t1 - t0;
Thanks for reply,
hm, I am not sure about code snippet that it has significant changes in compare with test code.
I have similar code in benchmark, please have a look

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
void testStdMapStr(std::string* keys, uint32 countKeys)
{
	#ifdef STD_MAP_TESTS

	printf("std::map => ");

	std::map<std::string, uint32> mymap;

	clock_t start, finish;

	//INSERT ===========================================

	start = msclock();

	for (uint32 i = 0; i < countKeys; i++)
	{
		const char* str = keys[i].c_str();
		mymap[keys[i]] = str[0];
	}

	finish = msclock();

	printf("Insert: %d msec, ", (finish - start));

	totalMapTime += (finish - start);

	//SEARCH ===========================================
	start = msclock();

	for (uint32 i = 0; i < countKeys; i++)
	{
		const char* str = keys[i].c_str();
		if (mymap[keys[i]] != str[0])
		{
			printf("Error\n");
			break;
		}
	}

	finish = msclock();

	printf("Search: %d msec.\n", (finish - start));

	totalMapTime += (finish - start);

	//ha.print();

	#endif
}


Did you get another results ?
I can try modify my test code if needed.
Yeah, I saw the benchmarking code. It's just that I'm not sure how this:
 
mymap[keys[i]] = str[0];
compares to this:
 
ha.insert((uint32*)str, STR_KEY_LEN, str[0]);
For one, assuming the string is not duplicated, the std::map version requires at least two allocations (one for the node and one for the string). Tries usually store characters directly in their nodes, so you're performing at least one fewer allocation.
Yes, bit-wise tries are known to be fast.

Although this process might sound slow, it is very cache-local and highly parallelizable due to the lack of register dependencies and therefore in fact has excellent performance on modern out-of-order execution CPUs. A red-black tree for example performs much better on paper, but is highly cache-unfriendly and causes multiple pipeline and TLB stalls on modern CPUs which makes that algorithm bound by memory latency rather than CPU speed. In comparison, a bitwise trie rarely accesses memory, and when it does, it does so only to read, thus avoiding SMP cache coherency overhead. Hence, it is increasingly becoming the algorithm of choice for code that performs many rapid insertions and deletions, such as memory allocators (e.g., recent versions of the famous Doug Lea's allocator (dlmalloc) and its descendents).
https://en.wikipedia.org/wiki/Trie#Bitwise_tries
,

However, bit-wise tries impose a crippling limitation on the object types which can be used as keys: they must be TriviallyCopyable types with the additional requirement that all bits of the object representation must participate in the value representation.

Though this is usually true for the standard arithmetic types except bool, the IS guarantees this only for narrow character types.
For narrow character types, all bits of the object representation participate in the value representation. ... For unsigned narrow character types, each possible bit pattern of the value representation represents a distinct number. These requirements do not hold for other types.


On a cursory perusal, this appears to a good implementation; perhaps you may want to explicitly specify the requirements on the key in the documentation.
which comes down to a general truth... the stl is super generic and awesome for pretty much everything, but if you need speed, hand-rolled can still beat it if you are willing to take the time to roll out high performance code. It usually takes large data to notice this, but when you need something 30, 60, etc times a second and it takes 5 seconds for the behind the scenes memory dithering and babysitting that the stl does, ... sometimes you have to do it yourself.


Though this is usually true for the standard arithmetic types except bool

Memory with parity support and ECC memory comes to mind (although I'm unfamiliar with it). Would a system incorporating this kind of memory not use all the bits of the object representation in the value representation? Specifically, when is this not true in practice?

Why is this restriction applied to only narrow unsigned character types?
Last edited on
ECC is transparent to the software. The parity bits are visible only to the hardware.

An example of a representation where not all bits participate (equally) in the value could be an integer type with negative zero. In such hardware, the following could be true:
1
2
3
4
5
unsigned char a[] = {0, 0, 0, 0};
unsigned char b[] = {0, 0, 0, 0x80};
int *A = (int *)a;
int *B = (int *)b;
assert(*A == *B);
A bitwise map would not be able to correctly handle assignments if such values are used as keys.
Last edited on
ECC is transparent to the software. The parity bits are visible only to the hardware.

Good to know, and thanks for the signed-zero example.

Last edited on
hm,

However, bit-wise tries impose a crippling limitation on the object types which can be used as keys: they must be TriviallyCopyable types with the additional requirement that all bits of the object representation must participate in the value representation.


Probably, my algorithm not corresponds exactly to bit-wise tries, because I wouldn't say that it is a common restriction. Trie it's just a Trie, he read and encodes keys inside. It's designed to use in databases, where table could be stored on a disc. However, if you don't want read a key sequentially (composite key), it is possible to make modification where you will skips insignificant bytes, that is not a problem.

In my opinion main beauty of algorithm that it's inherits sequences DNA, it's really effective encoding, searching and sub-scanning keys.

ECC is transparent to the software. The parity bits are visible only to the hardware.

An example of a representation where not all bits participate (equally) in the value could be an integer type with negative zero. In such hardware, the following could be true:
unsigned char a[] = {0, 0, 0, 0};
unsigned char b[] = {0, 0, 0, 0x80};
int *A = (int *)a;
int *B = (int *)b;
assert(*A == *B);
A bitwise map would not be able to correctly handle assignments if such values are used as keys.


I am not sure that it's about this problem,
but in my code there is implementation which helps override logic with sign integers
It helps make right order in the structure such as -10, -5, 0, 1, 15 etc
It's important if you want after scan keys by range. For example lookup all keys from -7 to 9

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
//int comparator =====================================================

	static uint32 NormalizeInt32(void* key)
	{
		int num = ((int*)key)[0];

		if (num < 0)
		{
			return 2147483647 - (num * -1);
		}
		else
		{
			return (uint32)num + 2147483647;
		}
	}

	static int CompareSegmentInt32(void* keySeg1, void* keySeg2, uint32 index)
	{
		if (((int*)keySeg1)[0] < ((int*)keySeg2)[0])
			return -1;

		if (((int*)keySeg1)[0] > ((int*)keySeg2)[0])
			return 1;

		return 0;
	}

	static int CompareInt32(void* key1, uint32 keyLen1,
						    void* key2, uint32 keyLen2)
	{
		uint32 keyLen = keyLen1 < keyLen2 ? keyLen1 : keyLen2;

		for (uint32 i = 0; i < keyLen; i++)
		{
			if (((int*)key1)[i] < ((int*)key2)[i])
				return -1;

			if (((int*)key1)[i] > ((int*)key2)[i])
				return 1;
		}

		if (keyLen1 < keyLen2)
			return -1;

		if (keyLen1 > keyLen2)
			return 1;

		return 0;
	}
> Why is this restriction applied to only narrow unsigned character types?

This restriction (all bits of the object representation participate in the value representation) applies to all narrow character types: char, signed char and unsigned char. The reason is that the IS requires that any object can be inspected as a sequence of bytes. For instance, ostream::write and a subsequent std::istream::read (write and subsequently read the object representation as a sequence of char) should work correctly for StandardLayout types.

That each possible bit pattern of the value representation represents a distinct number is necessarily true only for unsigned character types (eg. the representation of integral types may be the ones’ complement representation in a conforming implementation).


> Specifically, when is this not true in practice?

An example would be the type long double in a GNU / x64 implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>
#include <limits>
#include <bitset>

int main()
{
    std::cout << "size in bytes: " << sizeof(long double) << '\n' 
              << "bits per byte: " << std::numeric_limits<unsigned char>::digits << '\n'
              << "bits in object representation: " << sizeof(long double) * std::numeric_limits<unsigned char>::digits << "\n\n"
              << "bits in mantissa: " << std::numeric_limits<long double>::digits << '\n' 
              << "min exponent: " << std::numeric_limits<long double>::min_exponent << '\n' 
              << "max exponent: " << std::numeric_limits<long double>::max_exponent << '\n' ;
}

size in bytes: 16
bits per byte: 8
bits in object representation: 128

bits in mantissa: 64
min exponent: -16381
max exponent: 16384

http://coliru.stacked-crooked.com/a/a51c5cb95b8ff162

ie. 128 bits in the object representation of which only 80 bits (mantissa:64, biased exponent:15, sign:1) are used in the value representation.
Last edited on
Guyz, I am still searching good associative containers for comparing,
If you know something interesting in performance, let's compare.
Last edited on
Topic archived. No new replies allowed.