Efficiency

The following code snippets should be a good representation of my situation.
Which of the following would be more efficient:

1
2
3
4
5
int sum=0;
for(int i=0; i<large_num; ++i) {
	sum+=int_from_calculation;
	sum%=1007000000;
}


versus:

1
2
3
4
5
6
7
int sum=0;
for(int i=0; i<large_num; ++i) {
	sum+=int_from_calculation;
	if(sum>=1007000000) {
		sum-=1007000000; //LB's Optimisation 
	}
}

The 1007000000 is usually just some large number.
Last edited on
I don't actually think there's a difference, unless the compiler isn't very good at optimizing. A better solution would be to use the second with -= instead of %= but again I don't know whether this is better than just letting the compiler optimize it. Have you tried some benchmarks?
I think that the first one would be more efficient. Modulus is not an expensive operation. Why not see what the assembly looks like, and do a test using gprof?
Why not see what the assembly looks like

Well I would imagine that the jump would cause a loss of cached instructions, however I think that a modulus is quite an expensive operation (from what I remember from my assembly book, it does a divide and leaves the modulus in edx, with the divide being much slower than an conditional branch).
Seeing the assembler doesn't always help. Why not time them? That's the only way to know.
The 1007000000 is usually just some large number.


Then the two are not necessarily equivalent. For example, assuming 32-bit integers, suppose during one of the iterations when you reach line 4 that sum is 2^31. Then we have line 5 yielding

sum =2280967296 - 1007000000 = 1140483648

which is bigger than 1007000000, whereas the modulus will always be less than 1007000000.
Good point, I didn't think of that. It can be fixed by changing the if to a while loop but at this point you seriously have to consider which is more efficient.
@Alrededor int_from_calculation is always less than the modulo number
If you do time the alternatives, then you could time a ternary version too:
1
2
3
4
5
const Cap = 1007000...
...
add = int_from_calc(); // assert( add < Cap )
room = Cap - sum;
sum = ( add < room ) ? add + sum : add - room;
How would I time the programs?
It depends on the OS.
Windows
http://msdn.microsoft.com/en-us/library/windows/desktop/ms644904%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/ms644905%28v=vs.85%29.aspx

Can QueryPerformanceCounter at the start and stop timer positions. Divide the difference with the value obtained from QueryPerformanceFrequency.

I've written some detailed stuff with code, but I can't find it.
> Which of the following would be more efficient:

Depends on the values of large_num, the minimum, maximum and expected average of the value of int_from_calculation, and the value of the divisor (1007000000 in the example). This obviously assumes that the time taken to evaluate int_from_calculation does not dominate the time taken by the loop; if it does, it doesn't matter how you write the loop.

Profile your code if you are not sure; don't try guesswork. Performing tail-splitting based on the actual run-time value of the variables (to get the most optimal version executed) may be worthwhile if large_num is very large, the operation is time-critical and the code is executed extremely frequently.

For instance (assuming positive values for the sake of brevity):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
#include <cstdlib>
#include <limits>

// non-const with external linkage, to disable constant folding

extern int large_num ; // invariant: large_num > 0

extern int divisor ; // 1007000000 // invariant: divisor > 0

extern int ubound ; // upper bound on value of int_from_calculation
                     // invariant: ubound > 0
                     // invariant: ubound < ( std::numeric_limits<int>::max() - divisor )

// invariant: lower bound on value of int_from_calculation > 0


// optimal if expected average value of int_from_calculation is
// comparable to std::numeric_limits<int>::max() / 2
int foo()
{
    int sum = 0 ;

    for( int i=0 ; i<large_num ; ++i )
    {
        sum += std::rand() % ubound ;
        sum %= divisor;
    }

    return sum ;

    /*
    GCC 4.9 i386 with -std=c++11 -pedantic-errors -Wall -Wextra -fomit-frame-pointer -c -S
        __Z3foov:
            subl	$28, %esp
            movl	$0, 12(%esp)
            movl	$0, 8(%esp)
            jmp	L4

        ; *** loop starts here ***
        L5:
            call	_rand
            movl	_ubound, %ecx
            cltd
            idivl	%ecx ; *** division each time through the loop
            movl	%edx, %eax
            addl	%eax, 12(%esp)
            movl	_divisor, %ecx
            movl	12(%esp), %eax
            cltd
            idivl	%ecx
            movl	%edx, 12(%esp)
            addl	$1, 8(%esp)
        L4:
            movl	_large_num, %eax
            cmpl	%eax, 8(%esp)
            jl	L5
        ; *** and ends here ***

            movl	12(%esp), %eax
            addl	$28, %esp
            ret
	*/
}

// optimal if expected average value of int_from_calculation is
// much smaller than std::numeric_limits<int>::max() / 2
int bar() // ( a mod c + b mod c ) == (a+b) mod c, verifirs that a+b does not result in UB
{
    int sum = 0 ;
    const int maxv = std::numeric_limits<int>::max() - ubound ;

    for( int i=0 ; i<large_num ; ++i )
    {
        sum += std::rand() % ubound ;
        if( sum > maxv ) sum %= divisor ;
        // or: if( sum > maxv ) sum -= divisor ;
        // %= is more expensive, -= will execute more often
    }

    return sum ;
    /*
    GCC 4.9 i386 with -std=c++11 -pedantic-errors -Wall -Wextra -fomit-frame-pointer -c -S
        __Z3barv:
            subl	$28, %esp
            movl	$0, 12(%esp)
            call	__ZNSt14numeric_limitsIiE3maxEv
            movl	_ubound, %edx
            subl	%edx, %eax
            movl	%eax, 4(%esp)
            movl	$0, 8(%esp)
            jmp	L8

        ; *** loop starts here ***
        L10:
            call	_rand
            movl	_ubound, %ecx
            cltd
            idivl	%ecx
            movl	%edx, %eax
            addl	%eax, 12(%esp)
            movl	12(%esp), %eax
            cmpl	4(%esp), %eax ; *** comparison each time through the loop
            jle	L9

            ; *** division only if unavoidable
                movl	_divisor, %ecx
                movl	12(%esp), %eax
                cltd
                idivl	%ecx
                movl	%edx, 12(%esp)

        L9:
            addl	$1, 8(%esp)
        L8:
            movl	_large_num, %eax
            cmpl	%eax, 8(%esp)
            jl	L10
        ; *** and ends here ***

            movl	12(%esp), %eax
            addl	$28, %esp
            ret
    */
}

// optimal if the invariant:
// std::numeric_limits<int>::max() > (large_num*ubound) holds
int baz() // ( a mod c + b mod c ) == (a+b) mod c, a+b does not result in UB
{
    int sum = 0 ;

    for( int i=0 ; i<large_num ; ++i )
        sum += std::rand() % ubound ;

    return sum % divisor ;
    /*
    GCC 4.9 i386 with -std=c++11 -pedantic-errors -Wall -Wextra -fomit-frame-pointer -c -S
        __Z3bazv:
            subl	$28, %esp
            movl	$0, 12(%esp)
            movl	$0, 8(%esp)
            jmp	L13

        ; *** loop starts here ***
        L14:
            call	_rand
            movl	_ubound, %ecx
            cltd
            idivl	%ecx
            movl	%edx, %eax
            addl	%eax, 12(%esp)
            addl	$1, 8(%esp)
        L13:
            movl	_large_num, %eax
            cmpl	%eax, 8(%esp)
            jl	L14
        ; *** and ends here ***

            movl	_divisor, %ecx
            movl	12(%esp), %eax
            cltd
            idivl	%ecx ; *** division only once, at the end
            movl	%edx, %eax
            addl	$28, %esp
            ret
	*/
}
Last edited on
@kbw
Thank you for the links, and I have found an thread that explains the use of it sufficiently, so if you cannot find the code it is okay.
http://stackoverflow.com/questions/1739259/how-to-use-queryperformancecounter

@JLBorges
This obviously assumes that the time taken to evaluate int_from_calculation does not dominate the time taken by the loop

For argument's sake, assume that the time is negligent.

Profile your code if you are not sure; don't try guesswork

Would that be timing it and looking at the resulting asm code?

tail-splitting

What is this? Google only gave me results of "split-tail" which I highly doubt is what you mean.

assuming positive values for the sake of brevity

values will always be positive.

Also thank you for the code and asm. I will definitely need to brush up on a few asm instructions and then I will spend some time reading through it. Thank you again.

> Profile your code if you are not sure; don't try guesswork
>> Would that be timing it and looking at the resulting asm code?

Timing it.


> tail-splitting
>> What is this?

Split the tail of the function (containing the loop). Something like

1
2
3
4
5
6
7
8
9
10
int foobar()
{
      examine the run-time values of the variables involved

      if( baz() would be correct and optimal ) return baz() ;

      else if( bar() would be optimal ) return bar() ;

      else return foo() ;
}



> I will definitely need to brush up on a few asm instructions
> and then I will spend some time reading through it.

It is a good idea to gain some understanding of assembly code. But you don't really need to do that in this particular case. The assembly code was there just to illustrate that at compile-time, the compiler does not have all the information that is required to completely optimize certain kinds of code. Of course, profile-guided optimization would help, and you should use it (along with representative profiler data) when appropriate.

For the above code, it is adequate if you understand that:

a. You have to ensure that sum += int_from_calculation ; does not result in integer overflow
b. In foo(), the modulo operation is performed large_num times.
c. In bar(), a check (comparison of integers) is performed large_num times, but the expensive modulo operation is performed much less often.
c. In baz(), no check is made in the loop and the modulo operation is performed just once, at the end.
Subtraction is almost always faster than modulus. Modulus needs division, division *is* slow on most processors, much slower than addition/subtraction. Bah, I guess it's even slower than multiplication.

The interesting part of the equation is that if condition. The branch is easy to predict and if int_from_calculation is small, most of the time it won't be taken. Therefore most of its invocations are almost free (within one cycle). Even in case of misprediction, the speculative execution would still hide most of latency... Comparisons and conditional jumps are very cheap on modern processors, which can prefetch and execute *both* branches simultaneously on two parallel pipelines and just cut one off after the result of condition is known.

Therefore I guess the version with subtraction would be faster in almost all situations.
Last edited on
> Subtraction is almost always faster than modulus. Modulus needs division,
> division *is* slow on most processors, much slower than addition/subtraction.

Repeat: it depends on the run-time values of the variables involved.

For instance, if divisor is 1000000, sum is std::numeric_limits<int>::max() - 2000000 and the upper bound of int_from_calculation is std::numeric_limits<int>::max() - 1000000, modulo divion would be much faster than brute repeated subtraction in a loop.

Repeat: Don't rely on guesswork. Get an estimate the typical run-time values, profile to determine the value of n at which n comparisons and subtractions are more expensive than a modulus operation, and write the code accordingly.
Topic archived. No new replies allowed.