why ++i is better than i++ (when applica

Forum

Forum
Lounge
why ++i is better than i++ (when applica
Page 2

why ++i is better than i++ (when applicable)

Pages: 123

That is actually perfectly valid for containers that don't invalidate iterators to other elements in the container when erase is used. it is advanced before the map element is erased. erase works on a (temporary) copy.

I was going to argue this and claim it was a classic example of undefined behavior (modifying the same variable twice in a sequence point). However, assuming the iterator is a complex type and not a typedef of a built in type, then you're right. This is legal and "safe".

Though I still cringe.

LB (13399)

So this is undefined?

void f(int){}
//...
int i = 0;
f(i++);

I'm having trouble understanding what part is undefined and why...

Disch (13742)

No that is perfectly fine.

As cire said... it's probably fine with map (I think -- at least I can see a reason why it wouldn't be now that I've thought about it more). But it might not work on all containers so I still don't recommend doing it.

cire (8284)

However, assuming the iterator is a complex type and not a typedef of a built in type

I'm not sure why it wouldn't also be safe if the assumption is false. side effects for arguments are guaranteed to be sequenced before the function body is entered, and in any case the function would still be operating on a copy. erase takes iterators by value.

Disch (13742)

I'm not sure why it wouldn't also be safe if the assumption is false.

Yeah you're right. I can see it failing to work as expected for a container like deque or vector, but not because of the undefined behavior I was thinking of.

I stand corrected.

LB (13399)

Still, I'll get out of that habit because it is inconsistent among containers that use iterators.

BlackSheep (424)

It's undefined because an erase() on a vector causes it to move everything that comes after the erased element.
Erasing a map's element doesn't have this effect because a map uses a binary tree as storage instead of an underlying array.

So (Disch, you'll probably hate me) something like this should still be "valid":

myVector.erase(it--);

rapidcoder (1010)

(b++ vs ++b)
They aren't doing slightly different things. They are doing different things.

Yes, but it is easy to reorder code in such a way you change prefix to postfix without changing semantics. This can be often achieved by e.g. adding 1 somewhere or changing a condition from > to >=, etc. But compilers aren't *that* smart yet.

e.g often you see code like this:

if (++a <= some_const) ....

It can be simply changed into postfix:

if (a++ < some_const)

which avoids pipeline stall and *is* faster.

Last edited on

LB (13399)

@rapidcoder shouldn't your < and <= be swapped?

1
2
3

int x = 0, y = 0;
if(++x   <  1){} //false
if(  y++ <= 1){} //true

BHX (3450)

L B, yeah, seems that way, but you are assuming they are initializing x and y to 0 while they could be doing a mathematical figure or test where they could be negative (which would make both true). Guess it should be noted that his example is case based.

Last edited on by closed account z6A9GNh0

LB (13399)

It is completely not case-based. He used 'a' in both cases, saying they were equivalent result-wise. The fact that x and y are 0 does not matter, I just made them the same value because otherwise the example would make no sense.

BHXSpecter wrote:
they could be negative (which would make both true)

The point of the code was to show that it is not exactly the same and that there are cases where they are not the same.

Last edited on

JLBorges (13770)

> which avoids pipeline stall and *is* faster.

Don't spout rubbish about things that you don't understand.
Pipeline stall, indeed.

With the logical error corrected (not that that makes a difference as far as performance is concerned):

extern int a ;

bool foo()
{
    constexpr int some_const = 10 ;
    return ++a <= some_const ;

    /* g++ -std=c++11 -Wall -O3 -fomit-frame-pointer -c -S
        __Z3foov:
        LFB0:
            movl	_a, %eax
            addl	$1, %eax
            cmpl	$10, %eax
            movl	%eax, _a
            setle	%al
            ret
    */
}

int bar()
{
    constexpr int some_const = 10 ;
    return a++ < some_const ;

    /* g++ -std=c++11 -Wall -O3 -fomit-frame-pointer -c -S
        __Z3barv:
        LFB1:
            movl	_a, %eax
            cmpl	$9, %eax
            leal	1(%eax), %edx
            setle	%al
            movl	%edx, _a
            movzbl	%al, %eax
            ret
    */
}

rapidcoder (1010)

In the second code cmpl and leal can be executed in parallel by the CPU on different pipelines. In the first code, you've got read (cmpl) immediately after write (addl), and cmpl will have to wait for the addl to finish., which is... pipeline stall. Which means that the only those two addl and cmpl in the first example can take 10-20 CPU cycles.

Judging performance by counting opcodes is indeed, rubbish, on your side.

BTW: Little offtopic. Which is faster an why:

int[10000] array;
for (int i = 0; i < 10000; ++i)  
  array[i] = random();

// ---- code to measure time starts here:
int count = 0;
for (int i = 0; i < 10000; ++i)
  if (array[i] < RAND_MAX / 2)
    ++count;

int[10000] array;
for (int i = 0; i < 10000; ++i)  
  array[i] = random();

std::sort(array, array + 10000);

// ---- code to measure time starts here:
int count = 0;
for (int i = 0; i < 10000; ++i)
  if (array[i] < RAND_MAX / 2)
    ++count;

Which one is faster, excluding array initialization. I'm asking only about the last for loop. Both are *identical* and yield *identical machine code*.

Last edited on

Cubbi (4774)

just for fun, I compiled a couple programs: one calls foo() (that is, return ++a <= some_const ;), the other calls bar() (that is, return a++ < some_const). One billion times each.

After dumbing down the compiler options I usually use so that they don't do whole-program optimizations, the results were:

average of 5 runs

             foo()    bar()
gcc/core7    2.49s  2.49s
clang/same   2.49s  2.49s
intel/same   2.18s  2.18s
gcc/opteron  2.66s  2.66s
clang/same   2.65s  2.65s
xlc/p750     3.17s  3.17s
gcc/same     3.25s  3.29s
sun/t5000    5.69s  5.68s

Assembly on gcc/x86 looked just like in the post above (Intel used more cpu registers)

I'd say practice shows that there is no difference.

JLBorges (13770)

> In the second code cmpl and leal can be executed in parallel by the CPU on different pipelines.

Right. Because the second code requires the use of two registers instead of one.

Even assuming that these functions are not inlined, with Link time optimization (which among other things perform interprocedural register allocation), this would need two extra instructions to save and restore the register around the call.

If a++ < some_const was inline, LTO is not involved; but the register would still have to be released. Push edx would stall waiting for push eax to finish, even before the section of code was entered into. Unless the program is embarassingly tiny, and one is running on a machine with oodles of registers, the code requiring the use of an extra register would be slower. That is theory; in reality, since int is a basic type, every self respecting compiler would know how to generate code that runs with the same efficiency for both versions.

EDIT: Thanks, Cubbi.

Hadn't seen your post when I wrote the above, but your post proves the last point. Though, I suppose that it was something that a C++ programmer would already know; in fact almost every C++ programmer who had posted in this thread mentioned something to that effect.

EDIT2: Hadn't seen this addendum either.

> Judging performance by counting opcodes is indeed, rubbish, on your side

Judging performance taking into account register usage; no count of op-codes were ever mentioned. If my memory serves me right, with an option like -march=corei7, the count of opcodes would be the same for both versions; but the code generated for the postfix increment would still use that crucial extra register.

Last edited on

LB (13399)

rapidcoder wrote:
Which one is faster, excluding array initialization. I'm asking only about the last for loop. Both are identical and yield identical machine code.

It is faster with the sorted data due to branch prediction.
http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array

Cubbi (4774)

PS: assembly, because I was curious, others might be as well

Sun Studio:
foo:
    sethi   %hi(a),%o4
    ld      [%o4+%lo(a)],%o5
    add     %o5,1,%o5
    st      %o5,[%o4+%lo(a)]
    sra     %o5,0,%o3
    sub     %o3,11,%o2
    retl    ! Result =  %o0
    srlx    %o2,63,%o0
IBM XLC
foo:
    lwz        r3,T.25.a(RTOC)
    lwz        r4,0(r3)
    addi       r0,r4,1
    addi       r4,r4,-10
    stw        r0,0(r3)
    or         r0,r0,r4
    rlwinm     r3,r0,1,31,31
    bclr       BO_ALWAYS,CR0_LT
GCC on ibm
foo:
    lwz 9,LC..0(2)
    lwz 10,0(9)
    addi 10,10,1
    cmpwi 7,10,10
    stw 10,0(9)
    crnot 30,29
    mfcr 3
    rlwinm 3,3,31,1
    blr
GCC on intel
foo:
        movl    a(%rip), %eax
        incl    %eax
        cmpl    $10, %eax
        movl    %eax, a(%rip)
        setle   %al
        ret
Intel:
foo:
        movl      $1, %ecx
        movl      a(%rip), %edx
        xorl      %eax, %eax
        incl      %edx 
        cmpl      $10, %edx
        movl      %edx, a(%rip)
        cmovle    %ecx, %eax
        ret
original:
bool foo()
{
    const int some_const = 10 ;
    return ++a <= some_const ;
}


bar:
    sethi   %hi(a),%o4
    ld      [%o4+%lo(a)],%o5
    sra     %o5,0,%o3
    add     %o5,1,%o5
    st      %o5,[%o4+%lo(a)]
    sub     %o3,10,%o2
    retl    ! Result =  %o0
    srlx    %o2,63,%o0

bar:
    lwz        r3,T.25.a(RTOC)
    lwz        r4,0(r3)
    addi       r0,r4,-10
    addi       r5,r4,1
    or         r0,r4,r0
    stw        r5,0(r3)
    rlwinm     r3,r0,1,31,31
    bclr       BO_ALWAYS,CR0_LT

bar:
    lwz 9,LC..1(2)
    lwz 10,0(9)
    cmpwi 7,10,9
    addi 10,10,1
    crnot 30,29
    stw 10,0(9)
    mfcr 3
    rlwinm 3,3,31,1
    blr

bar:
        movl    a(%rip), %eax
        leal    1(%rax), %edx
        cmpl    $9, %eax
        movl    %edx, a(%rip)
        setle   %al
        ret

bar:
       movl      $1, %esi
       movl      a(%rip), %ecx 
       xorl      %eax, %eax
       cmpl      $10, %ecx
       cmovl     %esi, %eax
       lea       1(%rcx), %edx 
       movl      %edx, a(%rip)
       ret 

bool bar()
{
    const int some_const = 10 ;
    return a++ < some_const ;
}

Last edited on

rapidcoder (1010)

@JLBorges

Judging performance by number of used registers is also wrong. As you can see, both codes execute equally fast. If machine code uses only one register, it doesn't mean the CPU will use only one register. CPUs do internal register renaming, in order to avoid pipeline stalls. Internally x86 has 76 or more registers, even in 32-bit mode. You have completely no control over it, even when coding in assembly (funny times that even assembly coders do not have full control over hardware). So for simple code, I'm not very surprised, they executed equally, because CPUs are getting smarter and smarter. There are also some bypasses so the preceding instruction need not to execute fully in order to start executing the next dependent instruction.

I recall on a different forum they did an experiment with complex expressions where lots of post/pre increments were used and preincrements turned out to be slower. But that was a few years ago on a different hardware.

The example with branch prediction shows that the times when you could safely guess "instruction A is always slower than instruction B" are over. Now, everything depends on the context.

Last edited on

JLBorges (13770)

> http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array

Thanks, LB, for that link. This was illuminating, particularly the bit about ICC performing a loop-interchange.

Update :

GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

JLBorges (13770)

> Judging performance by number of used registers is also wrong. As you can see, both codes execute equally fast. ...

Yes, you are absolutely right on that count.

That for native types, both would execute equally fast was the commonly held view among the C++ programmers who posted in this thread.

And then you came up with the categorical assertion that postfix increment 'avoids pipeline stall and *is* faster.'
Which is what started this debate.

> The example with branch prediction shows that the times when
> you could safely guess "instruction A is always slower than instruction B"
> are over. Now, everything depends on the context.

AFAIK, those days were over a very long time ago, ever since pipe-lining, cpu caches, translation look aside buffers etc. made their appearance. For example, John Lakos in his 1996 book, talks about it at some length while discussing the performance implications of inline functions.

Pages: 123