Strange compiler behavior

I'm using MSVC.

I'm writing an emulator, and I have all the opcodes implemented in their own functions, all in the same file. The functions are generated at compile time from a description of the emulated CPU. The emulator reads a byte from RAM and uses a lookup table of function pointers to decide which function to call.

Now, I was trying to do a little bit of optimization on the most-used opcodes. I wrote a replacement for one of them and placed it in a different file (A.cpp) from all the other opcodes (because the file that contains them is generated). Then I tried moving my hand-written implementation to the generated file (B.cpp), and I made no other changes.
When the function was implemented in B.cpp, the program ran 2.4% faster compared when implemented in A.cpp.

What could be causing this?
> When the function was implemented in B.cpp, the program ran 2.4% faster compared when implemented in A.cpp.
> What could be causing this?

Most probably, because of automatic inline substitution when the function is in b.cpp
http://coliru.stacked-crooked.com/a/8032bda14ecf4841
Interesting question!

I don't know definitely why right now, but my hunch is that the difference comes from some link-time optimization(s) that the linker is not doing across compilation-units by default.

Turning on whole-program optimization allows the MSVC linker to inline functions across compilation units and to perform call-graph partitioning to improve locality between them.

https://msdn.microsoft.com/en-us/library/0zza0de8.aspx

@JLBorges
What's going on with your example, there? f() is not defined.
Nevermind, I see.
Last edited on
closed account (48T7M4Gy)
And for anyone who wants to read up on 'auto inlineable':
http://stackoverflow.com/questions/18726337/inline-functions-automatic-inline
Enabling link time optimisation (-flto, generate GIMPLE) with the GNU tool-chain:
with -flto:
-----------
                  inlineable:      0.002 millisecs.
not inlineable without -flto:      0.002 millisecs.

http://coliru.stacked-crooked.com/a/6e7a55b72a7b1153
I found this resource that explains a bit of what GCC's whole-program optimization can accomplish:
https://gcc.gnu.org/projects/lto/whopr.pdf
Last edited on
But the functions cannot possibly be inlined into their call site. They're called through a lookup table of function pointers.
Also, I'm using LTCG, of course.
> But the functions cannot possibly be inlined into their call site.
> They're called through a lookup table of function pointers.

Calling through a lookup table of function pointers does not disable inline substitution; if an implementation can figure out the details, even calls through a lookup table are inlined.

The extent of optimisation that can be performed by the link time optimiser may be somewhat more limited than the optimisation that can be performed by the full optimiser used at compile-time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
int g( int c ) { return ++c ; }
int h( int c ) { return g(c) + 1 ; }
int i( int c ) { return h(c) + 1 ; }
int j( int c ) { return i(c) + 1 ; }

static int (* const table [] )(int) { &g, &h, &i, &j } ;

int fn( int arg, unsigned int pos )
{
    return table[pos%4](arg) ;
}

int foo() 
{
    return fn( 10, 3 ) ;
    // return 14 ;
    
    /* completely inlined, result computed at compile-time
    
    	movl    $14, %eax
	retq
    */
}

int bar( int a, unsigned int b ) 
{
    return fn( a, b ) ;
    // jump to table[ b&3 ]
    
    /*  fn inlined, followed by jump to the result of the table lookup
    
	    andl    $3, %esi
	    jmpq    *_ZL5table(,%rsi,8)    # TAILCALL    
    */
}

int baz()
{
    return bar( 25, 19 ) ;
    // return 29 ;
    
    /* completely inlined, result computed at compile-time
    
    	movl    $29, %eax
	retq
    */
}

http://coliru.stacked-crooked.com/a/1183d25ee694a29a
Last edited on
Like I said in the OP, the lookup table is indexed by reading a byte from RAM. The byte that's read cannot be anticipated by the compiler because it depends on I/O values and the emulated program's execution. So like I said, there's no way the compiler is inlining the functions.

Yes, the function that performs the lookup itself may be inlined, but I don't see how this is relevant to the discussion.
Why not tell the compiler to generate the assembly code in the two cases and compare them?
That would answer the what, not the why.
True, but how can you know why it's doing something until you know exactly what it's doing?
Topic archived. No new replies allowed.