Why doesn't Return Value Optimisation (RVO) break the function calling convention?

As stated in various sources, notably the holy standard, a compiler can prevent, under certain conditions, copying an object created and returned by a function by allocating the needed space in advance in the stack of the caller and passing its address to the callee. This particular feature is named Return Value Optimisation (RVO).

An example might perhaps better clarify this intricacy:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
vector<int> my_example(){
    vector<int> v;
    v.push_back(1);
    v.push_back(2);

    return v;
}

int main(){
    auto v = my_example();

    cout << "sizeof v: " << sizeof(v) << endl;
    for(int i = 0; i < v.size(); i++){
        cout << "[" << i << "] " << v[i] << endl;
    }
}


Now follow my steps within the debugger:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
 
(lldb) breakpoint set -b my_example
Breakpoint 1: where = cpp_helloworld`my_example() + 24 at main.cpp:10, address = 0x0000000000000e08
(lldb) r
Process 13612 launched: '/home/god/workspace/eclipse/cpp_helloworld/Debug/cpp_helloworld' (x86_64)
Process 13612 stopped
* thread #1, name = 'cpp_helloworld', stop reason = breakpoint 1.1
    frame #0: 0x0000555555554e08 cpp_helloworld`my_example() at main.cpp:10
   7   	using namespace std;
   8   	
   9   	vector<int> my_example(){
-> 10  	    vector<int> v;
   11  	    v.push_back(1);
   12  	    v.push_back(2);
   13  	
(lldb) expr -- &v
(std::vector<int, std::allocator<int> > *) $0 = 0x00007fffffffe4f0
(lldb) bt
* thread #1, name = 'cpp_helloworld', stop reason = breakpoint 1.1
  * frame #0: 0x0000555555554e08 cpp_helloworld`my_example() at main.cpp:10
    frame #1: 0x0000555555554ed8 cpp_helloworld`main at main.cpp:19
    frame #2: 0x00007ffff7157f4a libc.so.6`__libc_start_main + 234
    frame #3: 0x0000555555554d0a cpp_helloworld`_start + 42
(lldb) frame select 1
frame #1: 0x0000555555554ed8 cpp_helloworld`main at main.cpp:19
   16  	
   17  	
   18  	int main(){
-> 19  	    auto v = my_example();
   20  	    cout << "sizeof v: " << sizeof(v) << endl;
   21  	    for(int i = 0; i < v.size(); i++){
   22  	        cout << "[" << i << "] " << v[i] << endl;
(lldb) expr -- &v
(std::vector<int, std::allocator<int> > *) $1 = 0x00007fffffffe4f0


So apparently ::my_example & ::main are sharing the same location for the vector.
Back to the debugger session:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
(lldb) frame select 0
(lldb) register read
General Purpose Registers:
       rax = 0x00007fffffffe4f0
       rbx = 0x0000000000000000
       rcx = 0xfca678bc51299800
       rdx = 0x00007fffffffe608
       rdi = 0x00007fffffffe4f0
       rsi = 0x00007fffffffe5f8
       rbp = 0x00007fffffffe470
       rsp = 0x00007fffffffe430
        r8 = 0x00007ffff7dd5fc0  libstdc++.so.6`(anonymous namespace)::num_get_w
        r9 = 0x00007ffff7dcada0  libstdc++.so.6`typeinfo for std::locale::facet
       r10 = 0x000000000000033f
       r11 = 0x00007ffff716e6d0  libc.so.6`__GI___cxa_atexit
       r12 = 0x0000555555554ce0  cpp_helloworld`_start
       r13 = 0x00007fffffffe5f0
       r14 = 0x0000000000000000
       r15 = 0x0000000000000000
       rip = 0x0000555555554e08  cpp_helloworld`my_example() + 24 at main.cpp:10
    rflags = 0x0000000000000206
        cs = 0x0000000000000033
        fs = 0x0000000000000000
        gs = 0x0000000000000000
        ss = 0x000000000000002b
        ds = 0x0000000000000000
        es = 0x0000000000000000

Because the frame for ::my_example is in [%rbp, %rsp] = [0x00007fffffffe470, 0x00007fffffffe430] and %rbp < 0x00007fffffffe4f0, the address of the vector must belong to its parent.
Furthermore, the same address is in %rdi, so the caller had, somehow, the knowledge in advance, that it had to pass the address of the vector in %rdi. *

And here is my first question. Because these are two public functions, they can, at least potentially, though not in this example, belong to two different translation units. So how the heck a compiler, when producing the code for ::my_example() can assume that the caller is also playing the same game, i.e. expecting this optimisation of their own?
What if the caller was not providing the vector address in %rdi and expecting a copy of a vector being returned by ::my_example?

The follow up is analogous. In the named RVO case, the callee might not be even able to apply this optimisation at the end. So how a compiler, when translating ::main, can expect, from the caller point of view, that the address provided for the vector will be valid, and indeed, contain the initialised vector?

Regards,

Dean

* I am avoiding to post the assembly for the sake of brevity, the important bit is that %rdi is not copied from something else in the function before this point.
Last edited on
Since the ABI for C++ functions is not defined, the compiler could define internally that any function that returns a type T that is candidate for RVO must receive a T * as its first argument. In other words, the compiler silently rewrites your function signature to
 
void my_example(std::vector<int> &);
This is the reason why you can't mix compiler versions when linking C++ libraries.
The whole reason RVO (and NRVO) became so popular back in the 90s is that it does not require any awareness on the caller side.

In every C and C++ ABI spec (they are very well defined, btw, just not by the C or C++ standards), caller provides storage for the result of a function call. In SysV x64 ABI which you seem to be using, if the return type won't fit in two CPU registers, space for it is allocated by the caller and its address is passed to the function in %rdi (and the function must return the same address in %rax for whatever reason)

Without RVO, the function would have created the vector on its local stack, and before returning, would have copied/moved it to (%rdi). With RVO, the function uses (%rdi) directly.
Last edited on
thanks for your replies. Uops, I was indeed quietly assuming SysV amd64 ABI, specifically I am using linux x86-64 / gcc 7.2.1. Cubbi's explanation convinced me :-)

Just a minor clarification:
@helios I fear the behaviour is more subtle. My question was, assume that the functions are compiled in two different translation units, though this is not shown in the first post. Also, assume that the same compiler is being used for both translation units. At the caller side, the compiler has no inherently knowledge of the callee function's body, nor whether NRVO is going to occur. The only thing it sees is a declaration such as:
std::vector<int> my_example();
If you are not convinced by the above argument, reading the symbols from the compiled object (nm -nC object.o) shows the signature "my_example()", rather than "my_example(std::vector<int>&)".
Intriguing, is not it?
Just rephrasing Cuby's point is this is just a side effect of the SysV amd64 ABI that returned objects need to be allocated anyway by the caller and passed in the first argument of the function call. As far as the sizeof(object) > 16 (2 registers) or they define their own copy ctor or dtor [1]. I guess other platform ABIs have similar specs.

Dean

[1] https://www.uclibc.org/docs/psABI-x86_64.pdf pp. 19 and 22

At the caller side, the compiler has no inherently knowledge of the callee function's body, nor whether NRVO is going to occur.
It doesn't need to know that. The compiler compiling the call can rewrite the signature as long as the return type is known at the call site (it must be known for the program to be well-formed), and if this type matches the definition seen at the definition site, both compilers can deterministically deduce the same signature rewrite.

If you are not convinced by the above argument, reading the symbols from the compiled object (nm -nC object.o) shows the signature "my_example()", rather than "my_example(std::vector<int>&)".
That's just the signature that's decoded from the mangled name. The mangled name matches the signature in the code, not the effective signature in the ABI.

Just rephrasing Cuby's point is this is just a side effect of the SysV amd64 ABI that returned objects need to be allocated anyway by the caller and passed in the first argument of the function call. As far as the sizeof(object) > 16 (2 registers) or they define their own copy ctor or dtor [1].
I think in this case it's as Cubby says, but the hypothetical behavior I describe is possible and could work regardless of the particular calling convention.
I think I now understand your point. That would be like a compiler implementing an ABI of its own.

Thanks,
Dean
Topic archived. No new replies allowed.