Threading performance

I have a pipeline of sources, sinks, and filters that looks like
source -> filter -> ... -> filter -> sink
The sources and sinks are usually files, but they might be in-memory streams, or null devices or some other thing. The filters are generally CPU-intensive, so I want them to run in parallel.

I spent the last couple days working on an M:N round-robin fiber scheduler, so that M filters can run on N threads, where N = the number of CPU cores. When the input queue of a filter is exhausted or its output queue is full, the fiber can yield so the thread can be used to perform some other task.
The problem is that now I'm second guessing myself. Is there any real advantage to redoing the work the kernel does already? Would it make more sense to just run each filter in its own thread, even if this means running more the one thread per core, and let the kernel figure it out?

Currently, the only advantage I can think of is that now the coordinator thread can call pool.sync() to wait for the pipeline to stall (which would happen when some copy operation reaches an EOF, for example), and at that point all the fibers that are still running will be in a well-defined suspended state, which allows any thread to pretty much arbitrarily modify its data structures. Meanwhile, a thread would never really be suspended, it would just be in a
1
2
while (!this->ready())
    this->cv.wait();
loop.
if you have some reasonable number of those (say, 300 or less), they may as well each have a dedicated thread; that's how many realtime systems are designed (although without RT scheduler, it may be hard to reason about the progress)

One reason a custom scheduler could help in this setup is to keep each workload on the same core (and therefore, at least partially, in that core's caches) while different filters take a stab at it. That typically means a per-core local job queue and a policy that next job is picked up from the local job queue unless there's nothing in it and a global job queue has to be looked at.. and then if the local job queue and the global job queue are both empty, work-stealing from another core.

"coordinator thread" is a red flag though, what does it do that needs its own thread? Why can't any user thread call pool.sync()?
Last edited on
Yeah, I thought as much.
I don't pin the threads to a CPU nor the fibers to a thread, so no cache advantage from that. Cache optimization is beyond the scope of improvements at this point in the project, anyway.

although without RT scheduler, it may be hard to reason about the progress
What do you mean by this?

"coordinator thread" is a red flag though, what does it do that needs its own thread? Why can't any user thread call pool.sync()?
The main thread will do the coordination. I plan to present an interface that looks like streams that can wrap one another, while internally the objects construct the pipeline and perform synchronization. For example,
1
2
3
4
5
6
7
8
9
10
11
12
13
InputFileStream input("foo.bin");
// current pipeline: input
DecryptionStream crypto(input);
// current pipeline: input -> crypto
DecompressionStream decompress(crypto)
// current pipeline: input -> crypto -> decompress
OutputFileStream output("bar.txt");
// pipeline A: input -> crypto -> decompress
// pipeline B: output
input.copy_to(output);
// current pipeline: input -> crypto -> decompress -> output
// copy_to() calls pool.sync(), which waits until "foo.bin" has been fully decrypted
// and decompressed into "bar.txt" 
Last edited on
Topic archived. No new replies allowed.