So, after trying all sorts of fancy tricks (threads, SSE math, object recycling) to lower my CPU usage to an acceptable level, it finally dawned on me what the problem was, and a minor tweak to a data finally gave me the performance boost I was looking for.
I have a linked list of particles like this:
[p0, p1, p2, p3, p4, ..., pn]
The positions of particles in this list are updated frequently and new particles are constantly being added. When a particle goes outside of a predetermined boundary, it's removed from the list:
[p0, p2, p3, p4, ..., pn]
If I simply leak the memory, all is well, performance-wise. If I free that memory, when the next batch of new particles arrives the performance starts to deteriorate, and it reaches a worst point once the list has made one complete cycle of particles.
At first I thought I was getting memory fragmentation, since the particles were small, dynamically allocated, and pointed to by small dynamically allocated nodes on a list, so I tried adding removed nodes to a second list and recycling them. That didn't help at all.
The particular shape of the region the particles move through implies that the list of particles doesn't behave quite like a queue. Suppose the list goes from p0 to p1000, then p100 through p200 are most likely to be removed next. If we assume that
does behave approximately like a queue and returns incremental memory locations, this means that the first cycle through the list produces a nice ordered collection. The second cycle has some randomness to it, the third is even more random, and so on. This means that the CPU needs to fetch data basically at random to continue iterating over the list. So who's responsible for the nice performance at the start of the program? The cache.
And that's where unrolled lists come in: http://en.wikipedia.org/wiki/Unrolled_linked_list
By simply making the nodes in the list hold a short array of objects (not pointers) and a bitmap to track "null" positions, I managed to nearly double my performance. It only took me like ten minutes. Adding the conditional compilation to use DirectXMath plus all the code took me like twelve hours
and barely produced any difference compared to a straightforward implementation of vector and matrix arithmetic.
It's amazing how such a simple modification can have such a dramatic impact on performance. Evidently, modern architectures really
don't like linked lists.