Call reordering?

Compiler: MSVC 2015

I was writing a function to measure CPU speed when I noticed something extremely odd:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
double measure_ratio(int divisions, int iterations);

int main(){
	auto t0 = clock();
	auto result = measure_ratio(10000, 100);
	//auto result = measure_ratio(14142, 100);
	auto t1 = clock();
	std::cout << t0 << std::endl
		<< t1 << std::endl;
	std::cout <<
		"Density: " << result << "\n"
		"Time: " << (t1 - t0) / (double)CLOCKS_PER_SEC << " s\n";
	return 0;
}
Output:
8
8
Density: 0.096656
Time: 0 s

This program takes ~4 seconds to finish (measured with a stopwatch). If I increase the divisions to 14142 the output is basically the same (time is still 0 s), but the program then takes ~8 seconds to finish.

Looking at the disassembly, I see
1
2
3
4
5
00007FF7A4E41164 FF 15 86 20 00 00    call        qword ptr [__imp_clock (07FF7A4E431F0h)]  
00007FF7A4E4116A 8B F8                mov         edi,eax  
00007FF7A4E4116C FF 15 7E 20 00 00    call        qword ptr [__imp_clock (07FF7A4E431F0h)]  
/*...*/
00007FF7A4E411C0 E8 3B FE FF FF       call        measure_ratio (07FF7A4E41000h)  
What the hell? Why did it do that??
Last edited on
Given say this,
r = f1() + f2() * f3();
it is unspecified which order the functions are called in.
The only dependency is in how the results are combined.
It's perfectly reasonable to call f1() first and store the result in a register, even though it will be involved last in the full expression.

But in this case, the compiler can also see that measure_ratio() doesn't depend on t0, and the second call to clock() doesn't depend on result.
1
2
3
auto t0 = clock();
auto result = measure_ratio(10000, 100);
auto t1 = clock();

As with the single statement expression, the dependencies are only resolved at the cout statement, at which point t0,t1,result must have been evaluated (which they have been).

The compiler certainly doesn't know about the special significance of the temporal nature of the two calls to clock().

Perhaps make t0 and t1 volatile.
https://en.cppreference.com/w/cpp/language/cv
Yes, I ended up solving it with volatile.
It just surprised me that I had never encountered this behavior. It also seemed odd to me that the compiler would choose reorder a call between two calls with outside-observable behavior.

Question: What counts as behavior observable from outside for the purposes of this discussion? Obviously I/O, but what about mutex locking, atomic operations, etc.?
The least requirements on a conforming implementation are:

. Accesses through volatile glvalues are evaluated strictly according to the rules of the abstract machine.

. At program termination, all data written into files shall be identical to one of the possible results that execution of the program according to the abstract semantics would have produced.

. The input and output dynamics of interactive devices shall take place in such a fashion that prompting output is actually delivered before a program waits for input. What constitutes an interactive device is implementation-defined.

These collectively are referred to as the observable behavior of the program. [ Note: More stringent correspondences between abstract and actual semantics may be defined by each implementation. — end note ]
http://eel.is/c++draft/intro.abstract#6
closed account (E0p9LyTq)
VS 2017 doesn't seem to have this reordering issue, I tested the code from here:

https://stackoverflow.com/questions/26189166/can-you-reproduce-or-explain-this-visual-c-bug-with-ctime

Of course one test doesn't really show the issue doesn't happen, only that with one test case it doesn't.

@helios, not knowing how your measure_ratio() function is coded I can't test your code. :)
Here you go:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class C{
public:
	double r;
	double i;

	C(double r = 0, double i = 0): r(r), i(i){}
	C(const C &) = default;
	C(C &&) = default;
	C &operator=(const C &) = default;
	C &operator=(C &&) = default;
	double abs_sq() const{
		return r * r + i * i;
	}
	double abs() const{
		return sqrt(abs_sq());
	}
	const C &operator+=(const C &other){
		r += other.r;
		i += other.i;
		return *this;
	}
	C operator+(const C &other) const{
		auto ret = *this;
		ret += other;
		return ret;
	}
	const C &operator-=(const C &other){
		r -= other.r;
		i -= other.i;
		return *this;
	}
	C operator-(const C &other) const{
		auto ret = *this;
		ret -= other;
		return ret;
	}
	C operator-() const{
		return {-r, -i};
	}
	C operator*(const C &other) const{
		return {r * other.r - i * other.i, r * other.i + i * other.r};
	}
	const C &operator*=(const C &other){
		*this = *this * other;
		return *this;
	}
};

bool is_in_set(const C &c, int iterations){
	C z = c;
	for (int i = 1; i < iterations; i++){
		if (z.abs_sq() > 2 * 2)
			return false;
		z = z * z + c;
	}
	return z.abs_sq() <= 2*2;
}

double measure_ratio(int divisions, int iterations){
	std::uint64_t in = 0;
	std::uint64_t total = divisions;
	total *= total;
	double increment = 4.0 / divisions;
	for (int iy = 0; iy < divisions; iy++){
		double y = iy * increment - 2;
		for (int ix = 0; ix < divisions; ix++){
			double x = ix * increment - 2;
			if (is_in_set({x, y}, iterations))
				in++;
		}
	}
	return (double)in / (double)total;
}
closed account (E0p9LyTq)
Compiling your code now with the latest VS 2017, here is one run:
0
5271
Density: 0.0966761
Time: 5.271 s

And another:
0
5240
Density: 0.0966761
Time: 5.24 s

So based on this, helios, I would have to conclude that the VS 2017 compiler doesn't reorder unnecessarily.

Thanks for the code, are the test results I had close to what you were expecting?
closed account (E0p9LyTq)
Incidentally using measure_ratio(14142, 100) instead of measure_ratio(10000, 100) has these results:

0
10485
Density: 0.0966602
Time: 10.485 s
0
10500
Density: 0.0966602
Time: 10.5 s


Windows 10 Pro 64-bit, Intel Core i5 2400 @ 3.10GHz, 12.0GB Dual-Channel DDR3 @ 532MHz, compiled as 32-bit. Compiling for 64-bit the results are marginally slower.
Last edited on
Oh! I thought you were saying that 14142 and 10000 were taking the same amount of time! Now that would have been weird.

It's strange that you say that x86-64 runs slower than x86. I tested on a Ryzen and on a Haswell and on both the x86-64 build was substantially faster. 50% faster on the Ryzen and 37% faster on the Haswell. I also have a Nehalem, but it's a Linux box and I don't have any x86 VMs set up on it, so I could only test an x86-64 version.

Here are some numbers if you're interested:
1
2
3
4
5
6
7
8
9
Unit: iterations per microsecond

                           x86-64       x86
Ryzen 1700                    316       208
Ryzen 1700 (Hyper-V)          291       192
Haswell i5                    259   
Haswell i5 (VirtualBox)       258       188
Nehalem                       236   
K6-2 (455 MHz, c. 1999)                   5.94
Last edited on
closed account (E0p9LyTq)
I was a bit surprised that 64-bit code ran a bit slower than 32-bit on 64-bit Windows 10, no matter how many divisions were calculated.

For instance, 64-bit execution speed for 14142 divisions was about 11.6 seconds vs. 10.5 for 32-bit. 10000 divisions was similarly marginally slower when run as 64-bit: 5.7/5.8 vs. 5.25.

But then the machine I ran the code on is a bit of a Frankenstein, having been upgraded from 64-bit Win 7 to Win 10, with lots of hardware changes over the years. I don't know what effect that could have, and frankly don't really care,
Last edited on
It also seemed odd to me that the compiler would choose reorder a call between two calls with outside-observable behavior.
What outside observable behavior? Clock() is probably inline so the compiler knows what it does and determines that it has no outside observable behavior other than the return value.

Along those lines, you could also probably get around this by using another function in a different source file:

file1.cpp:
1
2
3
4
clock_t myClock()
{
    return clock();
}

main.cpp:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
double measure_ratio(int divisions, int iterations);

// Compiler has no idea what this does while compiling main.cpp. For all it knows,
// myClock() launches a Mars probe and operates it for 18 years.
extern clock_t myClock();

int main(){
	auto t0 = myClock();
	auto result = measure_ratio(10000, 100);
	auto t1 = myClock();
	std::cout << t0 << std::endl
		<< t1 << std::endl;
	std::cout <<
		"Density: " << result << "\n"
		"Time: " << (t1 - t0) / (double)CLOCKS_PER_SEC << " s\n";
	return 0;
}
What outside observable behavior?
Getting the time is an I/O operation.

Clock() is probably inline
Nope, not in the implementation I'm using. It's imported from a DLL.
Even if if wasn't, extracting the call out to a different translation unit should not help, since the compiler can still tell that t1 doesn't depend on the computation of result.
Getting the time is an I/O operation.
If it's reading from some memory location that hasn't been declared volatile then it might conclude that that there is no observable behavior. Of course all of this is speculation. It would depend on the specifics of the call to clock().
Nope, not in the implementation I'm using. It's imported from a DLL.
Well damn that's strange then! It sounds like a bug to me.
Even if if wasn't, extracting the call out to a different translation unit should not help, since the compiler can still tell that t1 doesn't depend on the computation of result.
But it doesn't know what side effects the function has so it must call the function in the order given.

Hmm. Maybe the issue is actually in measure_ratio(). If the compiler can determine that measure_ratio() has no side effects then it might decide to move it. Put another way, maybe the compiler moved the call to measure_ratio() down, rather than moving the call to clock() up.
Even if if wasn't, extracting the call out to a different translation unit should not help, since the compiler can still tell that t1 doesn't depend on the computation of result.
But it doesn't know what side effects the function has so it must call the function in the order given.
That was my fault when I wrote the OP. measure_ratio() is in the same translation unit, I just didn't want to add unnecessary details.

Put another way, maybe the compiler moved the call to measure_ratio() down, rather than moving the call to clock() up
Yes, exactly. Both clock() calls happened in the correct order (at least AFAICT). It's the measure_ratio() call that was moved to a later point. Basically it rewrote it to
1
2
3
4
5
6
7
8
9
10
int main(){
	auto t0 = clock();
	auto t1 = clock();
	std::cout << t0 << std::endl
		<< t1 << std::endl;
	std::cout <<
		"Density: " << measure_ratio(10000, 100) << "\n"
		"Time: " << (t1 - t0) / (double)CLOCKS_PER_SEC << " s\n";
	return 0;
}
This is a historied issue with C++, and remains for valid C++ reasons. For both some context and a solution, check out Chandler Carruth’s answer at SO:

    https://stackoverflow.com/a/38025837/2706707
closed account (E0p9LyTq)
That was my fault when I wrote the OP. measure_ratio() is in the same translation unit, I just didn't want to add unnecessary details.

Just so you know when I ran your complete code it was contained within the same source file/translation unit.

VS2017 didn't optimize away the order, using default settings and an empty, need to add files, project.

OK, I'll think I'll just wander off now since the religious wars look like they are starting over this issue.

*waves*
Duthomhas: Yes, I saw that thread before. That's where I got the idea of using volatile, from this sentence:
DoNotOptimize(<expr>) forces the result of <expr> to be stored in either memory or a register.

But what I don't understand is what that function might be doing. What magic could they do to force that expression to execute at that point?
 
benchmark::DoNotOptimize(x += i);
It is using compiler magic to make the event “observable”.

I admit I don’t really understand it myself (nor do I care to), but Word of God says “that’s how to do it.”

Good enough for me.
Heh. Fair enough.
Topic archived. No new replies allowed.