Oh yes, I did look at the wrong branch. I got the right copy, and tested it against my implementation of cbuffers...yours is nearly 4X as fast. Strangely, though, my (updated) vectors version is another 10% or so faster. I updated the copy on mediafire if you're interested:
You'll see that we made a few change to the program, but I don't think it's anything that would change how one would implement the algorithm in question. (I also modified your version so that all passed arrays are the same size.)