Please, look at the test sample :
Very simple code: for (volatilelong i = 0; i < 1000000000;i++); // A billion loop
I see it took about 5 seconds to complete...
Let's divide : 3.2 x 109 / 1 x 109 / 5.0 ≈ 0.64 x 100% ≈ (64 %)
(My computer speed is 3.2 GHz...)
What do you think about this result and C++ performance?
With volatile, let's compare computers and compilers
3.55 GHz Power 750
IBM XL 11.1: 4.37 s
GNU g++ 4.7.2: 3.45 s (good job this time)
2.67 GHz Intel Core i7
GNU g++ 4.7.2: 2.44 s
LLVM clang++ 3.1: 2.44 s
Intel icpc 13.0: 2.43 s
I still think you need to turn up those optimization settings
(either way, this kind of loop is not exactly what CPUs are optimized for)