Performance question (vector.size() in loop condition)

Hello People,

In performance terms, ¿What is the correct/best way of doing loop control?

for ( int i = 0; i < v.size(); i++ ) {}

or

1
2
limit = v.size();
for ( int i = 0; i < limit; i++ ) {}


Thanks
If elements are not inserted into or erased from the vector in the body of the loop, there shouldn't be any measurable difference in run-time performance.

Using a range-based loop would perhaps be the most programmer efficient.

https://godbolt.org/g/h98ZRq
Thanks JLBorges!

Summarizing: The compiler really realizes if the loop condition is modified in the loop. Just in that case, it evaluates the size in every iteration.

Yet, the best way of doing this is using rang-based loop.
++I is supposedly faster than I++
and I have no clue if an iterator performs any better than an integer.

A debug or other not fully optimized build might show the function call for size taking longer. It won't on a fully optimized build. Just in case you ever checked it and saw a difference, look to your settings before looking to tweak the code.

The second version is most likely faster because in the first case an indirection and a function call is involved. I don't think that all could be otpimized away.

When these differences become signifianct i.e. measurable is another question.

The compiler really realizes if the loop condition is modified in the loop.
I doubt that since it is not clear where (in what function) the modification is done. Thus the compiler cannot [easily] assume that v.size() doesn't change.
> if an iterator performs any better than an integer.

In general, it would be on par with array indexing.


> in the first case an indirection and a function call is involved. I don't think that all could be otpimized away.

Compilers have been optimising that all away for about two decades or so.
For proof, study the assembly listing.


> I doubt that since it is not clear where (in what function) the modification is done.
> Thus the compiler cannot [easily] assume that v.size() doesn't change.

Yes, code that is not const-correct tends to be not only more error-prone and brittle, but also less efficient.


There is no measurable difference in performance between any of these
(run the test a few times and look at the results):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <vector>

void unknown_function( const std::vector<int>& ) ;

std::size_t count0_index_1( std::vector<int>& vec )
{
    std::size_t n = 0 ;
    for( std::size_t i = 0 ; i < vec.size() ; ++i )
    {
        unknown_function(vec) ;
        if( vec[i] == 0 ) ++n ;
    }

    return n ;
}

std::size_t count0_index_2( std::vector<int>& vec )
{
    std::size_t n = 0 ;
    std::size_t sz = vec.size() ;
    for( std::size_t i = 0 ; i < sz ; ++i )
    {
        unknown_function(vec) ;
        if( vec[i] == 0 ) ++n ;
    }
    return n ;
}

std::size_t count0_iterator_1( std::vector<int>& vec )
{
    std::size_t n = 0 ;
    for( const auto& v : vec )
    {
        unknown_function(vec) ;
        if( v == 0 ) ++n ;
    }
    return n ;
}

std::size_t count0_iterator_2( std::vector<int>& vec )
{
    std::size_t n = 0 ;
    for( auto iter = std::begin(vec) ; iter != std::end(vec) ; ++iter )
    {
        unknown_function(vec) ;
        if( *iter == 0 ) ++n ;
    }
    return n ;
}


std::size_t count0_iterator_3( std::vector<int>& vec )
{
    std::size_t n = 0 ;
    const auto end = std::end(vec) ;
    for( auto iter = std::begin(vec) ; iter != end ; ++iter )
    {
        unknown_function(vec) ;
        if( *iter == 0 ) ++n ;
    }
    return n ;
}

http://coliru.stacked-crooked.com/a/40a3ace72979b475

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <iostream>
#include <vector>
#include <algorithm>
#include <ctime>

namespace { int call_cnt = 0 ; int r = 0 ; }
void unknown_function( const std::vector<int>& ) { ++call_cnt ; }

template < typename FN > void time_it( std::vector<int>& vec, FN fn )
{
    call_cnt = r = 0 ;
    const auto start = std::clock() ;
    r += fn(vec) ;
    const auto end = std::clock() ;
    std::cout << "call_cnt: " << call_cnt << "  r: " << r << "  milliseconds: "
              << (end-start) * 1000.0 / CLOCKS_PER_SEC << '\n' ;
}

int main()
{
    const std::size_t N = 128'000'000 ;
    std::vector<int> vec(N) ;
    for( std::size_t i = 0 ; i < N ; ++i ) vec[i] = i%5 ;

    std::size_t count0_index_1( std::vector<int>& vec ) ;
    std::cout << "\ncount0_index_1\n" ;
    time_it( vec, count0_index_1 ) ;

    std::size_t count0_index_2( std::vector<int>& vec ) ;
    std::cout << "\ncount0_index_2\n" ;
    time_it( vec, count0_index_2 ) ;

    std::size_t count0_iterator_1( std::vector<int>& vec ) ;
    std::cout << "\ncount0_iterator_1\n" ;
    time_it( vec, count0_iterator_1 ) ;

    std::size_t count0_iterator_2( std::vector<int>& vec ) ;
    std::cout << "\ncount0_iterator_2\n" ;
    time_it( vec, count0_iterator_2 ) ;

    std::size_t count0_iterator_3( std::vector<int>& vec ) ;
    std::cout << "\ncount0_iterator_3\n" ;
    time_it( vec, count0_iterator_3 ) ;
}

clang++ -O3
count0_index_1
call_cnt: 128000000  r: 25600000  milliseconds: 310.719

count0_index_2
call_cnt: 128000000  r: 25600000  milliseconds: 302.676

count0_iterator_1
call_cnt: 128000000  r: 25600000  milliseconds: 259.181

count0_iterator_2
call_cnt: 128000000  r: 25600000  milliseconds: 259.855

count0_iterator_3
call_cnt: 128000000  r: 25600000  milliseconds: 290.88

http://coliru.stacked-crooked.com/a/2aab35983f555a53

g++ -O3
count0_index_1
call_cnt: 128000000  r: 25600000  milliseconds: 299.913

count0_index_2
call_cnt: 128000000  r: 25600000  milliseconds: 310.956

count0_iterator_1
call_cnt: 128000000  r: 25600000  milliseconds: 299.72

count0_iterator_2
call_cnt: 128000000  r: 25600000  milliseconds: 294.523

count0_iterator_3
call_cnt: 128000000  r: 25600000  milliseconds: 263.103

http://coliru.stacked-crooked.com/a/ad713f2134676b12
coder777 wrote:
it is not clear where (in what function) the modification is done

It is clear in the example posted here (which is eliminated by compilers entirely since it does nothing) and in the less-trivial example in JLBorges's godbolt link (which emits identical assembly between the two variants, and better code in the range-for variant).

but yes, if inside the loop, the vector is passed by non-const reference to a function defined in another file, it will have to follow that function call with with the memory load of the two (adjacent, so it's cheap) pointers from the vector's impl in case the function resized/reallocated it.

edit: since JLBorges ninja'd this comment; that memory load of the two pointers is visible there in count0_index_1 right after the call to unknown_function. I bet it's a cache hit in the new demo.
Last edited on
Can I conclude now that these two loops have the same performance?

Meaning, this

for ( int i = 0; i < v.size(); i++ ) {}

or this

1
2
limit = v.size();
for ( int i = 0; i < limit; i++ ) {}


For those who manage assembler code, I have written the following code in order to
verify the difference between them. If that is possible.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <vector>
#include <algorithm>

void foo( std::vector<int> v1, std::vector<int> v2 )
{
  for( std::size_t i = 0 ; i < v1.size() ; ++i ) {
  	v1.pop_back();
	v2[i] = 0;
  }
}

void foo1( std::vector<int> v1, std::vector<int> v2 )
{
  for( std::size_t i = 0 ; i < v1.size() ; ++i ) {
    	v1[i] = 0;
  	v2.pop_back();
  }
}



The link is: https://godbolt.org/g/tiSyfs

Please if anybody can tell me "look, this is what make the two codes different in assembler"
I would thank a lot.

On the other hand, and the most important fact;
If I make reference to the tests elaborated by JLBorges, it seems to be that iterators are the best.
Is that the case?

Thanks!
Last edited on
Please if anybody can tell me "look, this is what make the two codes different in assembler"
I would thank a lot.


This differs between compilers in interesting ways. Between clang and gcc, the first loop was handled similarly, but for the second loop, clang left unnecessary memory accesses, while gcc not only avoided that but actually figured out how to eliminate the recalculation of the vector size in the loop!

Clang ( https://godbolt.org/g/Xr31Ba )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
  for( std::size_t i = 0 ; i < v1.size() ; ++i ) {
    v1.pop_back();
    v2[i] = 0;
  }
# on entry,
# v1's base pointer is in r8
# v1's end pointer is in rcx
# v2's base pointer is in rdx
# the loop counter is in rsi
.LBB0_2:
        addq    $-4, %rcx         # shrink v1 by sizeof(int) (v1.pop_back();)
        movl    $0, (%rdx,%rsi,4) # write 0 to rdx[rsi] (v2[i] = 0)
        incq    %rsi              # increment loop counter (++i)
        movq    %rcx, %rax        # put a copy of v1 end pointer to rax
        subq    %r8, %rax         # subtract from base pointer (this is v1.size())
        sarq    $2, %rax          # and divide by 4 - this is the second half of v1.size()
        cmpq    %rax, %rsi        # compare size to loop counter
        jb      .LBB0_2           # loop 


loop #2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  for( std::size_t i = 0 ; i < v1.size() ; ++i ) {
    v1[i] = 0;
    v2.pop_back();
  }
# on entry,
# v1 sits in (rdi), v2 sits in (rsi)
# v1's base pointer is in r8
# v1's end pointer is in
# v2's end pointer is in rcx
# the loop counter is in rdx
.LBB1_2:
        movl    $0, (%r8,%rdx,4) # write 0 to r8[rdx] (v1[i]=0)
        movq    %rcx, 8(%rsi)    # write current v2's end pointer to memory
        incq    %rdx             # increment loop counter (++i)
        movq    8(%rdi), %rax    # load v1's end pointer from memory to rax
        subq    %r8, %rax        # subtract from v1's start pointer (v1.size())
        sarq    $2, %rax         # divide by 4 (seocnd half of v1.size())
        addq    $-4, %rcx        # shrink v2's size by one (v2.pop_back())
        cmpq    %rax, %rdx       # compare size with loop counter
        jb      .LBB1_2          # loop 

That was really dumb to touch memory and recalculate size here, on clang's part. I wonder if the optimizer was fooled by the presence of two vectors into thinking the two arrays in them might alias (that would be the case if both were passed by reference)

gcc ( https://godbolt.org/g/ccqsVg )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  for( std::size_t i = 0 ; i < v1.size() ; ++i ) {
    v1.pop_back();
    v2[i] = 0;
  }
# in entry:
# v1's base is r8
# v1's end is rax
# v2's base is r9
# loop counter is rdx
# one pop_back is executed before the loop is entered
.L3:
        movq    %rax, %rcx       # copy v1's end for size() later
        movl    $0, (%r9,%rdx,4) # write zero to v2[i]
        addq    $1, %rdx         # increment loop counter
        subq    %r8, %rcx        # v1's end minus base (first half of v1.size())
        movq    %rax, %rsi       # ?? rsi is only used after the loop ends.
        subq    $4, %rax         # v1.pop_back()
        sarq    $2, %rcx         # second half of v1.size()
        cmpq    %rcx, %rdx       # compare loop counter with size
        jb      .L3              # loop 


loop #2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
  for( std::size_t i = 0 ; i < v1.size() ; ++i ) {
    v1[i] = 0;
    v2.pop_back();
  }
# on entry,
# v1's base is in rcx
# v1's size is precomputed and stored in rdx (!!!)
# the loop counter is in rax
.L9:
        movl    $0, (%rcx,%rax,4) # write 0 to rcx[rax] (v1[i] = 0)
        addq    $1, %rax          # increment loop counter
        cmpq    %rdx, %rax        # compare with size
        jne     .L9               # loop
# the size of v2 is shrunk after the loop ends, using the value of the loop counter 


Good job gcc, though an ideal compiler should have realized the functions do nothing and removed all code - those vectors are passed by value.
Last edited on
if you find that your compiler does actually charge more for the size() call in the loop, and choose to bypass that, do it with a const.

const int ltmp = v.size();
for(i = 0; i < ltmp; ++i)

Ive seen the const help the assembler generation and performance. Compilers are smarter now, though, and more often than not, you get the same thing for a variety of minor changes to the code.



That is a very good point jonnin.

Dear Cubbi, what an analysis!!! Great, excellent!
You confirmed that gcc is smart enough as to pull the size() computation out of the loop!!!
(when it is constant)

Thank you very much for the notes.

Yet I must insist, it seems range-based loops are the best option as JLBorges shown in his tests.

What do you think?
Last edited on
Thanks for the useful links Cubbi
Topic archived. No new replies allowed.