| That is actually perfectly valid for containers that don't invalidate iterators to other elements in the container when erase is used. it is advanced before the map element is erased. erase works on a (temporary) copy. |
|
|
| However, assuming the iterator is a complex type and not a typedef of a built in type |
| I'm not sure why it wouldn't also be safe if the assumption is false. |
|
|
(b++ vs ++b) They aren't doing slightly different things. They are doing different things. |
|
|
|
|
|
|
| BHXSpecter wrote: |
|---|
| they could be negative (which would make both true) |
|
|
|
|
|
|
foo() bar() gcc/core7 2.49s 2.49s clang/same 2.49s 2.49s intel/same 2.18s 2.18s gcc/opteron 2.66s 2.66s clang/same 2.65s 2.65s xlc/p750 3.17s 3.17s gcc/same 3.25s 3.29s sun/t5000 5.69s 5.68s |
a++ < some_const was inline, LTO is not involved; but the register would still have to be released. Push edx would stall waiting for push eax to finish, even before the section of code was entered into. Unless the program is embarassingly tiny, and one is running on a machine with oodles of registers, the code requiring the use of an extra register would be slower. That is theory; in reality, since int is a basic type, every self respecting compiler would know how to generate code that runs with the same efficiency for both versions. | rapidcoder wrote: |
|---|
| Which one is faster, excluding array initialization. I'm asking only about the last for loop. Both are *identical* and yield *identical machine code*. |
|
|
bar:
sethi %hi(a),%o4
ld [%o4+%lo(a)],%o5
sra %o5,0,%o3
add %o5,1,%o5
st %o5,[%o4+%lo(a)]
sub %o3,10,%o2
retl ! Result = %o0
srlx %o2,63,%o0
bar:
lwz r3,T.25.a(RTOC)
lwz r4,0(r3)
addi r0,r4,-10
addi r5,r4,1
or r0,r4,r0
stw r5,0(r3)
rlwinm r3,r0,1,31,31
bclr BO_ALWAYS,CR0_LT
bar:
lwz 9,LC..1(2)
lwz 10,0(9)
cmpwi 7,10,9
addi 10,10,1
crnot 30,29
stw 10,0(9)
mfcr 3
rlwinm 3,3,31,1
blr
bar:
movl a(%rip), %eax
leal 1(%rax), %edx
cmpl $9, %eax
movl %edx, a(%rip)
setle %al
ret
bar:
movl $1, %esi
movl a(%rip), %ecx
xorl %eax, %eax
cmpl $10, %ecx
cmovl %esi, %eax
lea 1(%rcx), %edx
movl %edx, a(%rip)
ret
bool bar()
{
const int some_const = 10 ;
return a++ < some_const ;
} |
| Update : GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast. VC++ 2010 is unable to generate conditional moves for this branch even under /Ox. Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark... If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange). |