Google Benchmark, clang++ different results?

I was trying the google benchmark example from CppCon Chandler Carruth and I have completely different results (for vector reserve) on OsX and PC(running linux), and I am not sure why?
https://github.com/Horki/benchtest

1
2
3
4
5
6
7
static void bench_reserve(benchmark::State &state) {
  while (state.KeepRunning()) {
    vector<int> v;
    v.reserve(1);
  }
}
BENCHMARK(bench_reserve);


Os X

Run on (8 X 2200 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 262K (x4)
L3 Unified 6291K (x1)
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
bench_create 0 ns 0 ns 1000000000
bench_reserve 0 ns 0 ns 1000000000
bench_push_back 2 ns 2 ns 355207340

Edit: updated g benchmark on Os X, new result
Run on (8 X 2200 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 262K (x4)
L3 Unified 6291K (x1)
Load Average: 1.38, 1.73, 1.70
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
bench_create 0.308 ns 0.308 ns 1000000000
bench_reserve 0.302 ns 0.302 ns 1000000000
bench_push_back 2.05 ns 2.05 ns 343612248

Arch Linux

Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 8192K (x1)
Load Average: 0.05, 0.12, 0.15
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
bench_create 0.266 ns 0.266 ns 1000000000
bench_reserve 9.59 ns 9.59 ns 69819729
bench_push_back 9.58 ns 9.58 ns 73438758
Last edited on
My guess is that the default capacity under OS X is >= 1 and the default capacity under Arch Linux is 0. Thus v.reserve(1) is a NOP under OS X and allocates under Linux.
@dhayden, oddly I have runned it with *strace* and didn't see that it allocates under linux (it doesn't call mmap)

Are there some super hidden internals that compiles *that part* differently under clang for OsX and Linux?
Are you sure the compiler is not just optimizing away the call to reserve? I don't see that happening on x86-64 with Compiler Explorer but if I comment out the call to reserve it doesn't generate any code related to the creation of the vector (https://godbolt.org/z/Di5YgV). Based on the timings it seems like that is what's happening on Os X even with the call to reserve.
I can see lots of opportunity for aggressive optimisation to spot that vector<int> v; is never actually used for anything, and thus reduction of the whole function to some NOP.

Do you have the same version of the compiler on both machines?

Can you compare the assembler (compile with the -S flag) to see if both versions have some kind of vector code, or whether it has been optimised out.

What is the underlying precision of the clock being used to measure things?
For example, the struct timeval of gettimeofday() may have a microsecond field, but that doesn't imply it increments by +1 every single microsecond.

Perhaps it's using the processor TSC.
https://stackoverflow.com/questions/3388134/rdtsc-accuracy-across-cpu-cores
But make sure it's actually configured properly to enable it to give consistent answers across all cores.

as in github https://github.com/Horki/benchtest
on both machines are run as

clang++ -O3 -std=c++17 -lc++abi -Wall -Werror -pedantic -fno-exceptions -fno-rtti -pthreads -o bench bench.cpp -lbenchmark

code generated with clang++ -O3 -std=c++17 -S bench.cpp

Linux
https://github.com/Horki/benchtest/blob/master/clang_linux_bench.s#L16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# %bb.0:
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset %rbx, -16
	movq	%rdi, %rbx
	movq	(%rbx), %rax
	testq	%rax, %rax
	je	.LBB0_2
	.p2align	4, 0x90
.LBB0_5:                                # =>This Inner Loop Header: Depth=1
	addq	$-1, %rax
	movq	%rax, (%rbx)
	movl	$4, %edi
	callq	_Znwm@PLT
	movq	%rax, %rdi
	callq	_ZdlPv@PLT
	movq	(%rbx), %rax
	testq	%rax, %rax
jne	.LBB0_5


Os X
https://github.com/Horki/benchtest/blob/master/clang_osx_bench.s#L20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	pushq	%rbx
	pushq	%rax
	.cfi_offset %rbx, -24
	movq	%rdi, %rbx
	movq	(%rdi), %rax
	testq	%rax, %rax
	je	LBB0_2
	.p2align	4, 0x90
LBB0_5:                                 ## =>This Inner Loop Header: Depth=1
	decq	%rax
	movq	%rax, (%rbx)
	testq	%rax, %rax
jne	LBB0_5


EDIT:
on both machine is latest clang installed
1
2
3
4
Apple LLVM version 10.0.1 (clang-1001.0.46.3)
Target: x86_64-apple-darwin18.5.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin


1
2
3
4
clang version 8.0.0 (tags/RELEASE_800/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Last edited on
OK, so version 10 of the compiler managed to remove these instructions from the code.
1
2
3
4
5
movl	$4, %edi
	callq	_Znwm@PLT
	movq	%rax, %rdi
	callq	_ZdlPv@PLT
	movq	(%rbx), %rax

Do you now understand why fewer = faster?
@"salem c", why linux clang is not doing optimization?
You said "on both machine is latest clang installed".
Does that mean the latest official release so that both clang versions are exactly the same (sorry, I can't make sense of the Clang/LLVM versions in the link).
Or does it just mean you used the latest version that was available in the official package manager on those systems?

If you used different versions of clang I don't think it should be surprising that you might get slightly different instructions.
Last edited on
And what exactly is it that you want to benchmark? If you are not trying to benchmark the compiler's ability to optimize away useless code I think you are supposed to use benchmark::DoNotOptimize.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
static void bench_create(benchmark::State &state) {
  while (state.KeepRunning()) {
    vector<int> v;
    benchmark::DoNotOptimize(v);
  }
}
BENCHMARK(bench_create);

static void bench_reserve(benchmark::State &state) {
  while (state.KeepRunning()) {
    vector<int> v;
    v.reserve(1);
    benchmark::DoNotOptimize(v);
  }
}
BENCHMARK(bench_reserve);

static void bench_push_back(benchmark::State &state) {
  while (state.KeepRunning()) {
    vector<int> v;
    v.reserve(1);
    v.push_back(42);
    benchmark::DoNotOptimize(v);
  }
}


My understanding is that this will still allow optimizations but v will not be optimized away.
Last edited on
@Peter87

Trying to figure am I missing some *flags* when compiling, maybe add additional compile flags that "clang" is not producing, while "apple-clang" is producing?

I tought "clang" and "apple-clang" were pretty similar

on both machines latest available clang
[PC]
https://en.wikipedia.org/wiki/Clang#Status_history
20 March 2019 Clang 8.0.0 released --> this one is installed on PC

[Apple]
https://en.wikipedia.org/wiki/Xcode#Latest_versions
10.2 927.0.2 450.3 10.0.1 (clang-1001.0.46.3) -> this one is installed on MAC

I tought "clang" and "apple-clang" were pretty similar
https://en.wikipedia.org/wiki/Xcode#Toolchain_versions says XCode's 10.1 actually had clang 6.0.1, and doesn't yet have an entry for 10.2, but it can't be far off, it's probably in the 6.x series.

actually looking at https://github.com/apple/swift-llvm/blob/swift-5.0-RELEASE/CMakeLists.txt it's clang 7.0 - I updated that wikipage.
Last edited on
Topic archived. No new replies allowed.