Is there methodology for cache-efficient algorithms and data structure?

Dear experts

I am interested in high performance computing.
Sometimes I wonder if in-depth performance bottle neck is in memory access, which is partly resolved by cache memory architectures.
I heard that a lot of instructions are processed within one-memory access time,
so that I suppose that cache-efficient algorithms are more effective than parallel techniques for speed up.

Is this thinking correct?
And if so, I would appreciate it if you could tell me documents (textbook) about how to implement cache (or memory) efficient algorithms and data structures.
(Not simple multi-dimensional arrays, but some tree structures or so)

Kind regards
Last edited on
so that I suppose that cache-efficient algorithms are more effective than parallel techniques for speed up.
One seeks to do more in a given time frame without using more resources and the other seeks to use more resources to do more in a given time frame. They're not really comparable, nor mutually exclusive.

Not simple multi-dimensional arrays, but some tree structures or so
Graph-like structures are inherently cache-infriendly because the hardware is unable to predict the next memory location that will be accessed. There are a few data structures that seek to strike a compromise, like unrolled lists and B-trees, but if you want optimal use of the cache you have to go with a contiguous data structure.
Dear helios

Thank you for your rapid & useful reply.

They're not really comparable, nor mutually exclusive.

I understand that they can coexist for higher performance.

but if you want optimal use of the cache you have to go with a contiguous data structure.


Yes, contiguous data structures, this is a minimum prerequisite for cache efficiency.
I know some of linked data structure can be represented as contiguous data structures (c.f. binary tree etc).
Is a general knowledge (or catalog?) for best practices of such a usage?
> I know some of linked data structure can be represented as contiguous data structures

We could use pool allocators for most linked data structures;
while not being contiguous, they can still give excellent memory locality.
https://www.boost.org/doc/libs/1_75_0/libs/pool/doc/html/boost_pool/pool/pooling.html#boost_pool.pool.pooling.simple

This (long) paper (written some years ago, in 2007) may be of interest:
'What Every Programmer Should Know About Memory' https://akkadia.org/drepper/cpumemory.pdf

(Section 6 talks about 'What Programmers Can Do')

Talk about writing cache friendly C++
https://www.youtube.com/watch?v=Nz9SiF0QVKY from Meeting C++ 2018
Dear JLBorges and thmm

Thank you very much for your helpful information.

I am willing to study these techniques and foundation.
(As I study HPC domain, I have noticed that low-level (hardware & OS) knowledge is necessary to design better codes for high performance. It is tough work...)

Kind regards
I know some of linked data structure can be represented as contiguous data structures
Yes, a binary (or, well, an n-ary) tree can be laid out in memory such that its nodes are all in a single block of memory, but the access is still not too different from a pointer-based structure, from the point of view of the cache. Related nodes can still be arbitrarily far from each other and require a lot of jumping around to access.
My ramble :)
your own lightweight memory manager is a big help if you are making your own data structures, esp pointer based ones like trees. These don't take that much work: a vector allocates the memory for you, and you dole it out. Deletes leave the space and you reclaim the empty spaces next time you dole one out. A few methods around a vector, and its done, template it and its done for everything.

multi-threading is a near linear speedup (eg 4 threads tends to net you around 3.5, 3.75 times faster than 1 thread) when you don't have race conditions. But splitting the data that way often requires an afterwards 'stitch together' loop too. For example to sort, you can split a vector into 4 chunks, sort them in parallel, but you have to merge the 4 results back together afterward. But if each chunk fit in a page of memory? Maybe worth it? You certainly can have both in some designs, where each thread works on its own cache page, if your cpu has a cache for each core, that is? As I understand it most have a small cache per core, then higher level caches are shared across cores.

consider this: a BST, say you cram the tree into a vector... say the data constantly has inserts and deletes. Your search is randomish, its going to potentially touch a different cache page for each node it looks at. A vector binary search, though... looks at the first half (touches 2 pages) looks at the back half (touches 1 more page, eg you needed a page for highest, lowest, and middle values each iteration) so every iteration touches 3 pages at first, but then after a short time, it only touches one page over and over to finish up, or worst case, 2 pages, which will happen frequently. Which is better cache wise?

The idea isnt hard to see: however you access the data, you want it clustered so you access one page a lot, swap pages, access new page a lot, ... rather than access one, swap pages, access another, swap, ... so you design your code around that as best you can. Its not always possible: random access is going to have page faults. But any kind of ordered access, you may have good odds at setting it up to minimize the page faults. Avoid random access, if you can..
Last edited on
Dear helios and jonnin

Thank you for your experienced advices.

At least, there is limitation of cache efficient method particularly to linked data structure (graph, trees etc).
Some customized usage is needed to realize it.

By the way, I think that understanding memory layouts relating to cache gives a time to rethink fundamental knowledge of basic data structures like STL containers (std::vector, std::unordered_map, std::list etc).
There is theory of computation time based on Big O formula (e.g. O(1), O(log N), O(N)), as my opinion, I think as possible as we can, contiguous data structure (std::vector, std::array etc) should be selected to enhance performance via promotion of memory access.

How do you think this point? I hope you give your ideas and opinions.

Kind regards

it depends on what you are doing. I think you are focused on one thing, and not the big picture. But the big picture is where you get burned on performance -- modern c++ hides a lot of slowness from the casual coder, by nature of its powerful high level tools. Any tool can be misused.

The reason we have a dozen containers is that each one has weakness that the others are better at, in a rock/paper/scissors sort of fashion. Each one has its place, though vector is the clear winner as the default go-to container for most code. When vector isnt good enough, you figure out which one is best for that piece of code.

its difficult to maintain a sorted vector of rapidly changing data. (in terms of work done each time you change the data).
its cheap to maintain a sorted list.
its expensive to search a list. (one at a time)
its cheap to search a vector. (binary search)
and so on.

Big-O has issues if its your only idea of performance, for sure. Something that takes O(1) but runs for 1 min is less efficient than something that takes O(n*n) and spends 1 nanosecond per iteration for a long, long, long time. Eventually the O(1) wins but in practice its going to be useless. You have to work big-O, some sort of wall-clock time (cpu cycles is a good pick), page faults, bad habits and more when doing a performance pass. Bad habits... stuff like function foo that creates a fat object and destroys it again every time it runs in a high iteration count loop instead of passing a temp variable in by reference, for example.

but that is drifting far afield. Page faults are important. But they are just one piece of the big picture -- that is the take-away point here.
How do you think this point?
You can find a video online of Stroustrup arguing from a cache-centric perspective as opposed to a algorithmic complexity perspective, that if you needed a sequence that needed to be traversed pretty much every time it had to be appended to, it was better to use an std::vector than an std::list. This is because the cost of iterating the vector twice (once to process it and once to expand it, when necessary) would be dwarfed by the cost of iterating the list once, because of the constant cache misses caused by the list's non-locality.

This might be the one: https://www.youtube.com/watch?v=YQs6IC-vgmo
Stroustrup has also written about this; there is a small extract (and a link to the full article) in this post:
https://www.cplusplus.com/forum/beginner/206320/#msg976016
Dear all,

Thank you for your profound opinions.

In addition to cache-centric perspective, a contiguous data structure is suited to SIMD operation (e.g. Streaming SIMD Extensions).

I will design and code my data structures consciously in terms of the above discussion.

I deeply appreciate your helps.

Kind regards
Topic archived. No new replies allowed.