Why is `float` so much faster than `int` in this program?

This simple program copy data from the array `xy` to `arr` and prints the time it takes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
    #include <stdio.h>
    #include <time.h>
    
    class Point
    {
    public:
    	float x, y;
    
    	Point() :x(0), y(0) {}
    	Point(float x, float y) : x(x), y(y) {}
    };
    
    int main()
    {
    	const size_t n = 99999999;
    	Point *arr = new Point[n];
    
    	float *xy = new float[2 * n];
    	for (size_t i = 0; i < 2 * n; i++)
    		xy[i] = (float)i;
    
    	clock_t start = clock();
    
    	for (size_t i = 0, j = 0; i < n; i++)
    	{
    		float x = xy[j++];
    		float y = xy[j++];
    		arr[i] = Point(x, y);
    	}
    
    	clock_t end = clock();
    
    	printf("time: %d\n", (end - start));
    
    	delete[] arr;
    	delete[] xy;
    
    	return 0;
    }


It takes about 250 ms. If I change everything to `int`:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
    #include <stdio.h>
    #include <time.h>
    
    class Point
    {
    public:
    	int x, y;
    
    	Point() :x(0), y(0) {}
    	Point(int x, int y) : x(x), y(y) {}
    };
    
    int main()
    {
    	const size_t n = 99999999;
    	Point *arr = new Point[n];
    
    	int *xy = new int[2 * n];
    	for (size_t i = 0; i < 2 * n; i++)
    		xy[i] = (int)i;
    
    	clock_t start = clock();
    
    	for (size_t i = 0, j = 0; i < n; i++)
    	{
    		int x = xy[j++];
    		int y = xy[j++];
    		arr[i] = Point(x, y);
    	}
    
    	clock_t end = clock();
    
    	printf("time: %d\n", (end - start));
    
    	delete[] arr;
    	delete[] xy;
    
    	return 0;
    }


it takes almost 3 times the time, around 700 ms. Why is there is a difference? I expected it to be the same because they are both 32 bits and there are no arithmetic operations involved.
I am using Visual Studio 2015, Configuration: Release x64. Processor: intel i7-4510U.
Last edited on
If you take your two sets of code and put them into godbolt.org , you can see the differences in the assembly generated (although using a different compiler).

The first difference that jumped out at me was the use of different mov instructions for the float (movss) and the int (mov).

http://x86.renejeschke.de/ gives some information about different instructions.

Be sure to try with different optimisation settings ( -O0 to -O3 in the compiler options box).

For bonus points, you could identify the difference in assembly being generated on your own machine; godbolt makes it easier to see the assembly and match it up to the source code, but it's not magic and you can do the same yourself.

That might give you some insights, although there is a bit of research to do on your part, but don't get discouraged; understanding some of the assembly generated is a really handy skill. I spent a few hours just yesterday debugging a crash by reading the assembly leading up to the crash.
It would appear that the microsoft compiler does vectorise the floating point moves, but does not vectorise the integer moves.

Note: With a single instruction, we can move four adjacent 32-bit values into a 128-bit register (say, xmm0), and then move those 128-bits to four adjacent 32-bit locations in another single instruction. Both int and float are TriviallyCopyable 32-bit types on the platform in question, and the same set of machine instruction can be used for both.
See: https://godbolt.org/g/4npW0V

When the copy is vectorised, both int and float would take the same amount of time; TriviallyCopyable 64-bit types would take roughly twice the time and so on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#include <iostream>
#include <vector>
#include <ctime>

template < typename T > class Point
{
public:
    T x, y;

    Point() :x(0), y(0) {}
    Point( T x, T y ) : x(x), y(y) {}
};

template < typename T > void time_it( std::size_t n )
{
    std::vector< Point<T> > arr(n) ;

    std::vector<T> xy(2*n) ;
    for( std::size_t i = 0 ; i < 2*n ; ++i ) xy[i] = (T)i ;

    const auto start = std::clock() ;

    for (size_t i = 0, j = 0; i < n; i++)
    {
        const T x = xy[j++];
        const T y = xy[j++];
        arr[i] = Point<T>(x, y);
    }

    const auto end = std::clock() ;

    std::cout << (end-start) * 1000.0 / CLOCKS_PER_SEC << " millisecs.\n" ;
}

int main()
{
    const size_t n = 9'999'999;

    std::cout << "       float: " ;
    time_it<float>(n) ;

    std::cout << "         int: " ;
    time_it<int>(n) ;

    std::cout << "unsigned int: " ;
    time_it<int>(n) ;

    std::cout << "\n      double: " ;
    time_it<double>(n) ;

    std::cout << " std::size_t: " ;
    time_it<std::size_t>(n) ;

    std::cout << "   long long: " ;
    time_it<long long>(n) ;
}

clang++
       float: 25.469 millisecs.
         int: 24.453 millisecs.
unsigned int: 24.224 millisecs.

      double: 44.985 millisecs.
 std::size_t: 44.815 millisecs.
   long long: 44.88 millisecs.

g++
       float: 22.427 millisecs.
         int: 22.549 millisecs.
unsigned int: 22.619 millisecs.

      double: 45.922 millisecs.
 std::size_t: 50.013 millisecs.
   long long: 50.109 millisecs

http://coliru.stacked-crooked.com/a/0038cfc5cfbc024c

The non-vectorised copy (copy 32 bits at a time for 32-bit types) would take longer
than the vectorised copy (copy 128 bits - 4 values - at one go).
Microsoft:
       float: 25 millisecs.
         int: 97 millisecs.
unsigned int: 116 millisecs.

      double: 48 millisecs.
 std::size_t: 121 millisecs.
   long long: 121 millisecs.

http://rextester.com/HIFR26998
Optimisers tend to have a deep understanding of the standard C++ library (one of the reasons why well-written C++ code tends to be faster than equivalenct C code). Use the standard library (std::pair) to implement Point and the microsoft compiler vectorises everything that is vectorisable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/*
template < typename T > class Point
{
public:
    T x, y;

    Point() :x(0), y(0) {}
    Point( T x, T y ) : x(x), y(y) {}
};
*/

template < typename T > struct Point : std::pair<T,T>
{
    using base = std::pair<T,T> ;
    using base::base ;

    T& x() { return base::first ; }
    T& y() { return base::second ; }

    const T& x() const { return base::first ; }
    const T& y() const { return base::second ; }
};

Microsoft (R) C/C++ Optimizing Compiler Version 19.00.23506 for x64
       float: 29 millisecs.
         int: 30 millisecs.
unsigned int: 29 millisecs.

      double: 51 millisecs.
 std::size_t: 51 millisecs.
   long long: 52 millisecs.

http://rextester.com/BMMV79404
Topic archived. No new replies allowed.