General question about data structure usage

Hello,

I'm trying to learn and practice with some of the core data structures, like hash tables, linked lists, etc. So far I've been implementing them myself each time for practice, but I was wondering if people actually do this, or if pre-written data structures are already part of the language. For instance, is there a std::hash_table, and if so, do software engineers use it or implement their own?

Thanks!
Most people don't write their own because that would be a waste of time and it's difficult to get it right. If you use something from the standard library, or some other popular library, you can be pretty sure it has been well tested and contains very few bugs.

The hash table in the standard library is called std::unordered_map.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
Always use the containers in the standard library when appropriate. You will save yourself many, many hours in coding and debugging.
Last edited on
While I agree with @Peter87, the answer is that we have to learn about these data structures as engineers (including how to implement them) because each data structure has trade-offs when it comes to implementation and usage. Hash tables have different trade-offs than a self-balancing tree. We always make trade offs between raw performance and memory usage, between real-time performance and amortized performance, between insert and lookup, etc.

Production code should always use stuff that's been implemented in a library (whether std, boost, or some other library) that has been peer reviewed by experts.

But I still frequently create my own toy implementations so that I can understand how they are implemented and what the trade-offs are.

These come up frequently in interviews to make sure the candidate is someone who has a decent grasp of common data structures and algorithms.
Thanks for the answers!

Do you think it is important in an interview to have a master of the standard prewritten data structures, or are interviewers more interested in the applicants' knowledge of the theory behind these structures and their associated algorithms?
Personally, I'd be more concerned about an applicant's knowledge of the theory. An applicant may know the methods and usage of the standard containers by heart, but if they don't know the theory behind them, then how will he/she know which container to choose for a particular problem? The right choice is important.
hyperfine wrote:
For instance, is there a std::hash_table, and if so, do software engineers use it or implement their own?


There was hash_map in STL before C++
There is std::unordered_map in C++ (as of C++11)
There is also tbb::concurrent_unordered_map https://software.intel.com/en-us/node/506171
There is also hashtab.h from GNU libiberty https://gcc.gnu.org/svn/gcc/trunk/include/hashtab.h
There is also dense_hash_map from Google https://github.com/sparsehash/sparsehash
There are many more.

As with any common data structure, as a software engineer, you job will be making an informed decision on which one works best for your case, and if none does, writing and maintaining your own (my current job has its own, and one before had several).


hyperfine wrote:
are interviewers more interested in the applicants' knowledge of the theory behind these structures and their associated algorithms?

When I interview people, I ask about hash tables. Key questions are "what is a bucket", "how does hash become bucket index", and "what would be the hotspot if you look at this in a profiler" (where 'this' is a program inserting/updating a hash table a lot). And some concurrency stuff. For those who do well, "what is the exact worst-case complexity of insert if the table uses chaining and is doubling each time it exceeds the load factor", just because I like to know where people's limits are (some people do answer that). The goal is to see if the person understands what's going on in the computer, and not just repeating some textbook theory.
Last edited on
"what is the exact worst-case complexity of insert if the table uses chaining and is doubling each time it exceeds the load factor"

'ya can't just leave that there without the answer! :)

By "chaining" you're meaning you use a linked-list as a bucket instead of double hashing or something, and "doubling" you mean doubling the number of buckets?

Doubling the number of buckets happens in worst-case O(n) but constant amortized time. Looking through a bucket in the worst case also occurs in linear time. Their sum is linear time.
Last edited on
mbozzi wrote:
constant amortized time.

You probably remember that from textbook/lecture/references, which is great (a lot of people don't), but doesn't say much about your understanding of what is actually happening in the computer. What is that "constant"? Inserting N elements, how many calls will a profiler show made to that which is the CPU hotspot?
how many calls will a profiler show made to that which is the CPU hotspot


It's constant amortized time because each time the load factor is exceeded it takes linear time to copy all of the handles to each chain into the new collection of buckets.

When inserting N items you'll end up requiring lg N reallocations (where lg is the binary logarithm; binary because you're doubling the number of buckets).

We can assume that the reallocation itself takes roughly constant time. But that doesn't take into account copying the buckets over; that gives Σ from i = 1 to (lg N) of 2^i operations.

A little math to evaluate the sum gives 2(N + 1) operations.
Taken over N elements that's a constant, therefore each operation is O(2 + 2/n) = O(1) --- which is why insertion is a constant-time operation.

I think I may be missing the point of your question.
closed account (48T7M4Gy)
What is that "constant"?
http://stackoverflow.com/questions/200384/constant-amortized-time

( My mbPro has a non-constant CPU hotspot, especially when I run GIMP, but I digress. I know at first hand (ouch) why they milled the case out of a lump of aluminum. Should have called them an Apple HeatTabletPro )
@mbozzi

Hi,

You are way more advanced than me, and I would like to take this opportunity to say how impressed I am with all of your knowledgeable answers so far :+)

But I have a sneaking suspicion that effect of move semantics with possibly memory caching and concurrency might make quite an improvement. This is more of a tentative question rather than a demonstration of any real knowledge on my part :+)

My reason for mentioning this is that if one had M elements in a hash table, then added N elements - with move semantics wouldn't the new container look as though it was contiguous and not have to copy any of the M elements. With caching, I have no idea how this works - but have heard it offers improvement. I don't know about concurrency either, but if it can do it - then I would imagine that would make quite a bit of difference too.

With a Hash Table, if the number of buckets was doubled, wouldn't there be a lot of time spent recalculating the hash function? Maybe the hash table can do incremental resizing, or it might depend on what the resizing strategy is?

With buckets being linked lists, the wiki page mentioned that it was only worth it if there were at most 3 elements in the list per bucket.

I guess the best thing to do would be to actually test it with some timed code (or as Cubbi asked, with a profiler) - this might show a difference between theory and what actually happens. Apologies from a backyard basher if you have already done this before :+)

I put this link here so you can all see what I have been reading:
https://en.wikipedia.org/wiki/Hash_function

mbozzi wrote:
A little math to evaluate the sum gives 2(N + 1) operations.
Taken over N elements that's a constant, therefore each operation is O(2 + 2/n) = O(1) --- which is why insertion is a constant-time operation.


If the sum is 2(N + 1) operations, then if 1 operation is O(2 + 2/n) , would that be O(2) not O(1) ? And N operations would be O(2N) ?

But I suspect (speculate?) that might be irrelevant once it is tested, but I could quite easily be wrong :+)

Anyway, all this is very interesting, it could be a great learning experience for lots of people. Especially with expert advice from Cubbi or others of similar standing :+)


And N operations would be O(2N) ?

O(2N) equals O(N), just with a different constant. An algorithm taking exactly N/10 steps and an algorithm exactly 5*N+10*log(N)+1000 steps are both "O(N)" in computer science. Which is why engineers often care about the actual number of steps rather than asymptotic behavior at infinity.
@Cubbi

Thanks, that of course makes sense :+) It looks like I was tending more towards the engineers view: I guess I am more interested in "actual" , rather than the broad brush of Big O notation. Cheers :+)

@mbozzi

Apologies from the "back yard basher" :+D
I had a sort of epiphany a few months ago: from a practical standpoint: O(log(N)) ~ O(K).
That's because log(N) is so darn small. log2(1,000,000) is about 20. So if algorithm A runs in 2*N*log2(N) time, and algorithm B runs in 40*N time, then A will outperform B up to about N=1,000,000, even though A is slower in the big-O asymtotic sense.
No worries, and thanks :)

I guess you're asking about how to speed up the hash table?

Optimization in general is something that I struggle with. I don't run into a lot of situations that demand I micro-optimize my code. So take what I post about it with a grain of salt. Anyways:

You mentioned a few things -- caching, concurrency, and move semantics. You also spoke about how the container should be resized -- the rehash operation.

With a Hash Table, if the number of buckets was doubled, wouldn't there be a lot of time spent recalculating the hash function? Maybe the hash table can do incremental resizing, or it might depend on what the resizing strategy is?

Yes -- the hash function is the hotspot that @Cubbi was mentioning.
Also, the number of resizes is the number of elements currently in the container. The resize is the O(n) component that makes insertion amortized constant time. In terms of total time spent, the fewer resize operations you perform, the less often you need to pay for it. My point is that doubling, tripling, quadrupling, whatever, you still have to pay the same amount (assuming the table is the same size) for a resize operation. You just have to do it less often. So from a performance perspective you might as well start with a table that's too big to begin with and never pay for the resize.

On the other hand, if you grow your table by an order of magnitude each time a resize is required, you'll waste a lot of memory. The same principle is used for std::vector. I recently learned that GCC's glibc++ doubles the capacity of the vector each time a resize is required, but MSVC++ only multiplies the vector capacity by 3/2. The time complexity requirements are met either way, and the optimal choice is probably application-dependent.

What do you mean by "incremental resizing"?

Maybe move semantics be applied during a rehash -- especially if we're using chaining; each node could be moved even if the payload couldn't be -- the table just has to change node pointers. I'm unsure of how this (move-only approach) would interact with a concurrent table.

(The lack of) caching is important for a hash table; it slows it down.
In general, linked structures and random accesses are excellent at reducing the benefits of caching -- indirection goes a long way towards maximizing the number of cache misses and page faults. This is because the processor picks up predictable memory access and loads that memory asynchronously to program execution -- it tries to predict what memory will be accessed next.

If you're interested in this, Ulrich Drepper at Red Hat has written an excellent paper on it. It's slightly out-of-date (from 2007) but still relevant. It's targeted mostly towards optimizing for cache usage.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

There's also this FAQ by Bjarne Stroustrup discussing why linked structures (or random accesses in general) are often slower than expected (the answer is caching).
https://isocpp.org/blog/2014/06/stroustrup-lists

There has been some work done designing algorithms whose performance is cache-invariant (their term is "oblivious"). Of course there's also variations on algorithms that know about the cache, too, and take advantage of it. I expect that a quick hash table would need to know specifics.
http://www.cc.gatech.edu/~bader/COURSES/GATECH/CSE-Algs-Fall2013/papers/FLP99.pdf

@Cubbi was kind enough to help me determine the answer to his question (I was close, but he gave me the answer :)). He mentioned that after the question he posed he would usually discuss caching lookup data and then parallelization and the design of concurrent hash tables. So you're certainly onto something.
Last edited on
@mbozzi

Thanks for your detailed and informative reply :+)

Yes -- the hash function is the hotspot that @Cubbi was mentioning.


Awhile ago, I had a discussion with JLBorges about std::vector vs std::unordered_map. IIRC we had 1 million random ints. It turned out quicker with the vector to insert and sort them, as opposed to just insertion into the unordered map. So I learnt the hashing function takes a relatively long time.

What do you mean by "incremental resizing"?


I missed a link for what I was reading:

https://en.wikipedia.org/wiki/Hash_table

wiki wrote:
Incremental resizing

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually:

During the resize, allocate the new hash table, but keep the old table unchanged.
In each lookup or delete operation, check both tables.
Perform insertion operations only in the new table.
At each insertion also move r elements from the old table to the new table.
When all elements are removed from the old table, deallocate it.

To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.

Disk-based hash tables almost always use some scheme of incremental resizing, since the cost of rebuilding the entire table on disk would be too high.


Having read that, I had an idea of my own ... But a huge disclaimer: there a lot of very smart and experienced people out there - I am not one of them :+D

Say we 1 million items in a hash table, then on say 5 separate occasions we inserted a further 1 million each time - giving a total of 6 million at the end. Also, we had no way to predict the final size.

So my proposal is to chain the tables together - have a vector of hash tables: each time we have bulk insertions (or the size of an table is exceeded), create and append a new hash table and append the relevant hash function to a vector. The downside is that in order to find an item, one would need to possibly apply up to 6 hash functions on as many tables. But this might be better than having to rehash 6 + 5 + 4 + 3 + 2 = 20 million items versus 6 million with the chained tables.

This approach would have it's limitations, and like everything would have to be tested and compared. Just wondering whether that idea has any merit - or would it be a lead balloon compared to the really smart methods already implemented?



Thanks for the articles - I will have to ruminate on those for awhile :+)

So you're certainly onto something.


Well, not from any particular knowledge or inventiveness on my part :+)

JLBorges recently posted a quote from Stroustrup about how only the real experts know what happens in detail, especially with regard to cache effects.

Cubbi mentioned the concurrency - I guessed it was bound to offer some improvement.

With the move semantics I just remembered that it has a big effect on other containers.


With this whole Topic, it seems to me that it is not quite enough to look at the pro's and cons of std containers in terms of big O notation. With all these other factors going on, one really has to time and compare various options. I guess this goes back to Cubbi's comment about why Engineers care about the actual number of steps involved. I am sure JLBorges has mentioned the importance of timing over theory before.

I guess the clever part is trying to design something that will turn out to be better in particular situations.

Anyway, I look forward to any replies and hope all is well at your end - Regards :+)

I have thought more about my idea above, but I think std::vector would still be better.
Topic archived. No new replies allowed.