Pre-allocated memory data for parallel threading keeping locality

Dear experts

I think this post seems ambiguous and case-by-case question. But, if you have know-how, I would appreciate it if you could help me.


I think probably many people have experience to allocate memory data in advance before its use in loop (e.g. std::vector<Obj> dat(buffer_size);). To avoid repeated allocation of such a memory region is basic approach to speed up programs.

I tried this for parallel usage. From the conclusion, in my case, I succeed in speed up in serial run, but fail in speed up in parallel run (compared with the case that memory region is repeatedly allocated in each loop). It seems that the higher thread number becomes, the slower program runs.
(To avoid data race in each thread access to dat, dat is prepared for each thread (I tried 1 to 32 threads))

I suppose if this is due to Non-Uniform Memory Access (as known NUMA) because the pre-allocated memory regions are done in a single thread without consideration of memory locality.

If you know how to make each thread to access a pre-allocated memory region
near each cpu without data race,
I hope you give me your advice.

psuedo code is the below (I use TBB)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// thread number is set 
std::size_t max_thread_num = 16;

// Pre-allocated memory regions per each thread
std::vector<std::vector<Obj> > dats(max_thread_num);
for(auto& d : dats)
  d = std::vector<Obj>(buffer_size);

tbb::parallel_for(tbb::blocked_range<std::size_t>(0, LOOP_NUM), [&](const tbb::blocked_range<std::size_t> &r) {
      int thread_id = tbb::task_arena::current_thread_index();
      auto& dat = dats[thread_id]; 
      // std::vector<Obj> dat(buffer_size); //(in repeated allocation version)

        for (size_t i = r.begin(); i != r.end(); ++i){
           // process something by using dat 
        }
    });

Last edited on
It seems that the higher thread number becomes, the slower program runs.
You will only get a performance advantage by increasing the thread count up to std::thread::hardware_concurrency(). Going any higher will at best give you the same performance. (Generally. There are algorithms for which this doesn't hold.)

I suppose if this is due to Non-Uniform Memory Access (as known NUMA) because the pre-allocated memory regions are done in a single thread without consideration of memory locality.
Well, let's start with the obvious question: Is your computer a NUMA architecture? Do you have a dual socket motherboard or a Threadripper/EPYC CPU?
Dear helios

Thank you for your rapid reply.

Going any higher will at best give you the same performance. (Generally. There are algorithms for which this doesn't hold.)

Yes, I am keeping mind that at best performance.
But, up to thread number 2, performance is higher than repeated allocation, but beyond it, performance becomes slower than it.
(My algorithm has no critical section between each thread)

Well, let's start with the obvious question: Is your computer a NUMA architecture? Do you have a dual socket motherboard or a Threadripper/EPYC CPU?

My CPU is Xeon processors, and lscpu information is

model name: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node 0 CPU: 0-17,36-53
NUMA node 1 CPU: 18-35,54-71

core number per socket: 18
socket number: 2
NUMA node number: 2


Last edited on
Is there a way (or framework, library) such as the below?
(1) prepare Pre-allocated memory regions uniformly for used architecture.
(2) when parallel run, each thread use near memory region, and other thread does not use it to avoid data race.

Kind regards
Fair enough. NUMA is indeed very likely the problem, then. You will need an allocator that's NUMA-aware, and you will need TBB to inform you on which NUMA node a thread is running. I'm not sure if TBB is designed to do this.

In the mean time, you might want to try this:
1
2
3
4
//within thread lambda
auto &dat = dats[thread_id];
if (!dat.size())
    dat.resize(buffer_size);
Last edited on
Dear helios

Thank you for your kind reply.

In the mean time, you might want to try this:


I understand that what you meant is to allocate memory in each thread, which has high possibility to put it to near memory. I would like to try it for the test.

BTW, TBB is aware of the importance of NUMA, but currently it seems that TBB does not provide high-level API to handle NUMA.

https://link.springer.com/chapter/10.1007/978-1-4842-4398-5_20

I notice that performance check relating to memory location (not only NUMA, but cache etc) is severe for programmers, so that I get careful about how to check and improve my codes.


If anyone has experience to meet similar problems and the way to solve it,
I hope that you could share your knowledge.

Kind regards
Today, I did performance check in NUMA-arch machine (Xeon Gold) and non NUMA-arch (Core-i7).
And, non NUMA-arch showed scalable performance with thread increases, but NUMA-arch showed the degradation.

I confirmed that this performance problems is due to NUMA, memory access delay between different NUMA nodes.

Here, I am now considering how to cope with this.
If you have the best practice, I would appreciate your help.
(If possible, TBB's method is welcome)

Kind regards
Topic archived. No new replies allowed.