Interfacing C++ with assembler

I'm optimising some code by dropping into assembler. I wrote a C++ function that does what the assembler will be doing; got the compiler to output the assembler and then trimmed the results down to just what is needed to assemble the code and substitute the resultant object file for that produced by the C++ compiler, and that all works nicely.

Now my concern is that my hand-coded version will be full of magic numbers: offsets into structs, classes and vtables.

It's easy enough to pick these out of the compiler produced assembler and reuse them; but my concern is that it makes the assembler code very vulnerable to even minor changes in the C++ code.

I'm considering writing a C wrapper around the assembler function that would use it's view of the classes and structs and some pointer math something like:

1
2
3
4
5
6
7
8
9
10
11
12
extern void asmFunc( ptrdiff *d );

void wrapperForAsmFunc( ... ) {
    ptrdiff_t offsets[ ] = {
        &class X.member - &X,
        &struct Y.member -  &Y,
        ...
    }; 
    
    asmFunc( offsets );
    return;   
}


to get the compiler to derive an array of the magic numbers the asm routine needs, and then pass these into it; thus isolating the asm from minor changes to the classes and structs involved; packing boundaries, compiler versions etc.

1. Anyone done or seen anything similar that is accessible that I might crib from?

2. Beyond "it's not OO" and "it's not proper C++", can you see any show stoppers with the approach?

Cheers, Buk.
Last edited on
If the function contains virtual calls and member accesses, it's probably a poor target for hand-optimizations. Are you sure you profiled correctly? Could the hotspot be just part of the function?
> Are you sure you profiled correctly?

Helios,

Yes. This one function represents over 50% of the programs runtime for runs that total 90+hrs.

It is essential 4 loops -- 2 of which contains small nested loops -- running over a very large dataset that is effectively 7 parallel vectors that are applied (mult/dot) in various combinations to an eighth, 2D vector -- that is represented as a sparse array. All doubles.

{huge chunk of detail and rant elided :)]

Basically, I'm working with, by choice, for good reasons, MSVC 9. Maybe C++ compiler optimisers have got a lot more clever since; and maybe with the rationalisations to the STL that comes from C++11, a modern compiler would do a better job; but what I'm seeing is 30 lines of C++ consisting of 4 loops relatively simple loops; expanding to 300 lines of assembler with 20+ embedded method calls pre-inlining; which then with inlining and optimisation becomes 450 lines of interleaved and unintelligible mess that is far from optimal.

Simple example. there are 16 SSE register (32 if you split them) and yet only 4 are ever used. The routine contains two constants -- one static; one computed up front.-- that are reused in all four loops. They ought to be loaded into registers once; but instead they are reloaded over and over. Even in the optimised code. (And they are both marked const!)

Another: the body of one of the loops is Y[ i ] = X[ i ] * c; The result of the expansion of the iterators followed by the optimisers attempts, has that loop as (pseudo-code):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
move    reg1, i
move    reg2, &X
move    reg3, &Y
move    reg4, n
move    xmm3, c

loop:

move    xmm0, xmm3
mult    xmm0, qword ptr [ reg2 + reg1 ]
move    qword ptr [ reg3 + i ], xmm0
incr    reg1
cmp     reg1, reg4
jle     loop


Looks fine, but the problem is that the loop body has a cheap SSE reg->reg transfer, followed by an expensive memory to reg multiply and immediately another expensive reg to memory store.

The two consecutive memory operations stall the pipeline. A simple reordering of the SSE operations make the loop 25% more efficient.

1
2
3
4
5
6
7
8
9
10
11
...

loop:

move    xmm0, qword ptr, [ reg2 + reg1 ]    ; xmm0 <- X[ i ] 
mult    xmm0, xmm3                          ; x[  i ] * c
move    qword ptr [ reg3 + reg 1 ], xmm0    ; Y[ i ] <- X[ i ] * c
incr    reg1
cmp     reg1, reg4
jl      loop


(Please note: the above is a gross simplification of the actual loop that unrolls the loop by a factor of 4; the point is that there is a lot of fat in there to be excised.)

Combine that with the desire to do SSE vectorisation to do two operations at a time; and the need to do some interlocking (lock prefix) of some instructions and I hope it becomes obvious that this is not "premature optimisation" :)

Anyway, if there are any x64 assemblers with experience of interfacing with C++ -- if anyone actually read this far, I guess they might have -- then I'd love to make contact.

Cheers, Buk.

Last edited on
Would it be sensible to temporarily move the data into simpler data structures, possibly buffers, and pass those to the Assembly function instead? That way you'll also save yourself all those virtual calls, which are quite expensive.
Also, it's possible the compiler is having trouble optimizing very much if there's a lot of implicit member function calls.
Last edited on
> Would it be sensible to temporarily move the data into simpler data structures, possibly buffers, and pass those to the Assembly function instead?

That makes a lot of sense. Though I don't have to actually spend cycles moving the data.

Rather than replacing the whole method with an assembler routine, I've replaced each of the four loops with C subroutines (marked __declspec(noinline) that take pointers (to arrays of doubles; and arrays of ints etc.) and const ints (for loop limits) and const doubles (for derived loop invariant values).

The result is that all of the member accesses, struct offsets etc. are resolved by the compiler in the C++ code and once you get into the C code, you just have pointers to arrays and constants.

That has two beneficial side effects:

1) the assembler output for those subroutines is much simpler, the optimiser hasn't had the opportunity to reorder instructions between them.

which makes the code much simpler to understand and thus modify.

2) all the implicit method calls; class and struct offsets etc. have been resolved before those subroutines are called.

That makes the code within the subroutines impervious to most minor changes in the C++ code by removing all the magic numbers I was scared of.

I can now replace those C subroutines with hand-crafted assembler secure in the knowledge that changes to the structures external to them are all resolved by compiler before they are called.

Which is the Holy Grail I was seeking. So thank you for making look at the problem differently :)

Cheers, Buk
Topic archived. No new replies allowed.