memcpy

Pages: 123
Hello,

While trying to do some micro-optimization of my code, I came across the memcpy function. Since part of my code involves copying "large" arrays of ints and none of the copies could be combined into a single loop, I thought memcpy would provide some speed gain. Oddly, when comparing memcpy versus a for-loop of 10kk ints, the for loop came out faster.

This is the testcode: (a, b are int[icount], a is filled with random integers)
1
2
3
4
	for (int j = 0; j < 50; ++j) {
		memcpy(b, a, icount*sizeof(int));
	//	for (int i = 0; i < icount; ++i) b[i] = a[i];
	}

Results: memcpy 0.67sec vs for 0.59sec.

I was expecting them to be very close to each other [in release mode], at the most being equal due to the compiler realizing that it could replace the for copy by the same statements as memcpy. I definitely did not expect the for to be faster.

So why is this? Am I using memcpy wrong, or is there something else I'm missing?
It's possible that the compiler optimizes the for() by unrolling it.

http://en.wikipedia.org/wiki/Loop_unwinding

I'd suggest you stick to memcpy() for clarity, but that's my personal take.
Even then, why would it be faster than memcpy?
Unrolling removes/decreases the overhead from for, reducing the computations to "take 4bytes from point A and put them at point B"*n, which should still be slower than "take 4*n bytes from point A and put them at point B"*1.
closed account (1vRz3TCk)
I could just be that memcpy is not implemented in a way that is conducive with speed. Try some other compilers and see if you get different results.
I would prefer std::copy for this. Also, I'm not sure I understand why there is a loop on line 1. Isn't it a question of line 2 vs. line 3?
Last edited on
For regular int arrays, the std::copy method seems a bit too generic/fancy. I'll test it as well just to see.

The loop on line 1 is just to repeat the test 50 times for less randomness and more precision.

[edit]

std::copy has the same runtime as memcpy (0.67sec).

Another oddity: I heard a "common" optimization trick for loops was reversing the order (> 0 being faster than < const). Reorganizing the for loop to this:
for (i = icount; --i > 0;) b[i] = a[i];
leads to a runtime of... (drum roll): 0.67sec. The increment-for still clocks at 0.56~0.59sec.

(All tests are done using MSVStudio2010.)
Last edited on
C++11 code, GNU G++ 4.6.1, highest optimization level.
Results on my computer: memcpy() always fastest, followed by std::copy().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#include <cstdlib>
#include <cstring>
#include <cstddef>
#include <iostream>
#include <chrono>
#include <algorithm>

#define MY_LITTLE_PONY  int
#define NOW_TIME        std::chrono::high_resolution_clock::now()
#define MILLICAST       std::chrono::duration_cast<std::chrono::milliseconds>

int main()
{
    const size_t sz = 100000000;

    MY_LITTLE_PONY *a = nullptr;
    MY_LITTLE_PONY *b = nullptr;

    try
    {
        a = new MY_LITTLE_PONY[sz];
        b = new MY_LITTLE_PONY[sz];
    }
    catch (std::bad_alloc &e)
    {
        std::cerr << "Bad allocation: " << e.what() << std::endl;
        return EXIT_FAILURE;
    }


    auto memcpy_time_s = NOW_TIME;
    // ---
    memcpy(a, b, sz * sizeof (MY_LITTLE_PONY));
    // ---
    auto memcpy_time_e = NOW_TIME;


    auto for_time_s = NOW_TIME;
    // ---
    for (size_t i=0; i < sz; ++i)
        a[i] = b[i];
    // ---
    auto for_time_e = NOW_TIME;


    auto copy_time_s = NOW_TIME;
    // ---
    std::copy(b, b + sz, a);
    // ---
    auto copy_time_e = NOW_TIME;


    delete[] a;
    delete[] b;

    std::cout << "Durations in milliseconds for " << sz * sizeof (MY_LITTLE_PONY) << " bytes of data:\n\n";
    std::cout << "memcpy()\t"    << MILLICAST(memcpy_time_e - memcpy_time_s).count() << '\n';
    std::cout << "for loop\t"    << MILLICAST(for_time_e - for_time_s).count() << '\n';
    std::cout << "std::copy()\t" << MILLICAST(copy_time_e - copy_time_s).count() << '\n';
    std::cout << std::endl;

    system("PAUSE");
    return EXIT_SUCCESS;
}


Edit: deleted my own output which is irrelevant.
Edit2: fixed some mistakes, sorry.
Last edited on
Any chance you could post exact results?

I ran your code on MSVStudio2010 (after adjusting some stuff. Don't have the <chrono> header, so changed the timing to clock() from ctime).

memcpy: 285
for loop: 135
std::copy: 127
Here you go...
Durations in milliseconds for 230000000 bytes of data:

memcpy()        250
for loop        109
std::copy()     125

Press any key to continue . . .


The values change every time the program is run but they're consistent: for() is fastest.
So it seems std::copy() in MSVC is better optimized?

As for memcpy(), I think I understand why it's slower: it is type-agnostic.
It receives two void pointers (meaning "I don't care what type they are") and the number of bytes it has to copy.

Therefore the only straightforward implementation for it can be a loop that copies the elements byte by byte.

As for std::copy() I think there's a small overhead due to it being a template... although not in your case, it seems?
templates don't have overhead

It is hard to make good benchmarks.

This is the output I get when running Catfish's program as-is.
1
2
3
4
5
Durations in milliseconds for 400000000 bytes of data:

memcpy()	1734
for loop	1050
std::copy()	986


If I change the order and make the memcpy test last I get a different winner
1
2
3
4
5
Durations in milliseconds for 400000000 bytes of data:

memcpy()	523
for loop	1883
std::copy()	1017
Last edited on
I don't like this thread anymore, lol.

So I guess cache is the culprit?
Meaning that the b array must be assigned new random content in between the tests?
My original testing code only used one method per test, so no "in-runtime" things can influence the other test.

I don't see why being type-agnostic would make it slower. As I see it, the instructions boil down to "get memory block [start, end] and copy it. A for loop can make no assumptions; it can't even be sure the blocks are side by side (e.g. I could tell it to only copy every second item). In terms of overhead, 1*n should be much better than n*1, no?
Garminic, did you disable Checked Iterators? http://msdn.microsoft.com/en-us/library/aa985965.aspx

Catfish, yes it's a cache issue. Going through the arrays once before the tests should give more reliable results. It looks like the for loop and std::copy is about as fast. memcpy is faster but remember that memcpy doesn't give any guarantees if the ranges are overlapping. This could be the reason why memcpy can be much faster.
closed account (1vRz3TCk)
Would it not depend on the algorithm used in memcpy? A naive version may just copy a byte at a time, irrespective of the size of the memory bus. Maybe it is optimized in a way that is not as optimal as your for loop for that specific task. Maybe there is more code in there to make it more secure? Knowing Microsoft, the actual function is hidden behind several defines and the one you get will depend on your settings.
Yes, but have you ever seen a naive memcpy()? memcpy() is, in any usable implementation, the fastest method to copy non-overlapping blocks of data.

My original testing code only used one method per test, so no "in-runtime" things can influence the other test.
Well, post it, then.
If you are copying lots of data in one go, use memcpy. If you are copying small amounts of data, use for loops. It takes time for the program to begin the memcpy, where as for loops are built in to native c++.
In my experience, memcpy() is oftenoccasionally the slowest solution because it's a precompiled library function, a black box: the C++ compiler has no access to its source code and cannot use context-dependent optimizations. On the other hand, std::copy() and its more cumbersome equivalent, the for loop, are presented to the compiler as source code, and the compiler is free to rearrange it any way it feels like.

In addition, at least on my linux, memcpy() is compiled to be backwards compatible with older CPU models, while the C++ compiler compiles std::copy() and for loops using SSE instructions.

When in doubt, look at the assembly language output.
Last edited on
memcpy() is often the slowest solution because it's a precompiled library function, a black box: the C++ compiler has no access to its source code and cannot use context-dependent optimizations.
This argument doesn't really hold up to scrutiny. For example, it doesn't explain why the default copy constructor for PODs often calls memcpy(). If the compiler can really generate faster code than memcpy(), that would be an ideal place to do it.

When in doubt, look at the assembly language output.
My compiler calls memmove() for std::copy() with arrays of chars, and generates a naive loop with a couple MOVS instructions for std::copy() with arrays of this:
1
2
3
struct POD{
	char a[10];
};
The latter is over ten times slower than memcpy(). Turning on SIMD instruction sets optimizations doesn't do anything.
Last edited on
Did you change the size of the arrays? If all you did was making the size of each elements ten times bigger it's not strange it gets ten times slower.
No, of course I took something as obvious as sizes into consideration.
However, I made a mistake in the benchmark that made the compiler remove a whole chunk of code, which is why the array of PODs was so much faster. I'll have to trick it into behaving with I/O.

EDIT: Wait, no. I forgot to multiply by sizeof(T). memcpy() is still nearly twice as fast, though.
Last edited on
Pages: 123