Faster memcpy routine

Forum

Forum
General C++ Programming
Faster memcpy routine

Faster memcpy routine

//It can be further enhanced by making it 64bit.

unsigned long copy_linear(void* read,void* write,unsigned long size);

//speed up the memcpy process!
unsigned long copy_linear(void* read,void* write,unsigned long size)
{

int* w4=(int*)write;
int* r4=(int*)read;
unsigned long scan=0;

while(size>4)
{
w4[scan]=r4[scan];
size-=4;
scan++;
}

unsigned long adjust=scan*4;//get the last 3 bytes
char* w1=(char*)write;
char* r1=(char*)read;
scan=0;
while(size>0)
{
w1[scan+adjust]=r1[scan+adjust];
scan++;
size--;
}

return adjust+scan;//amount of bytes copied.
}

Cubbi (4774)

Faster than what, specifically?

DeXecipher (458)

I thought int was 32 bit on most machines. And the speed for that code will not be portable for every machine. You would have te detect the standard word-size for that system. And int is not always that.
Int is 64-bit on most 64 machines right?

Last edited on

DeXecipher (458)

I have an instinct that strcpy, memcpy, memmove, etc. are all implemented in optimized assembly, and in that case they will all be faster.

Last edited on

Catfish4 (666)

From what I know GCC provides some already optimized builtin functions, and memcpy() is among them:

http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
http://stackoverflow.com/questions/11747891/when-builtin-memcpy-is-replaced-with-libcs-memcpy

Int is 64-bit on most 64 machines right?

No, from what I know long int is the word size.

This means sizeof (long int) should equal 4 on a 32-bit machine, and 8 on a 64-bit machine (assuming a byte is an octet). I can't test right now.

DeXecipher (458)

Catfish4. Thanks

DeXecipher (458)

and nice coding style op, simple and straight forward.

Catfish4 (666)

Hmm, I may have been wrong about long int.
http://stackoverflow.com/questions/9988663/c-word-size-and-standard-size

mpauna (21)

The OP's code has a big portability bug. In the int loop, "sizeof(int)" should be used instead of hardcoding to 4. As others have hinted, the size of an int isn't fixed and may vary between processors, compilers, and even compiler options. The only known constraints are that the minimum range for an int is 16 bits.

Before considering this as a "faster" replacement for memcpy(), you should determine what (if any) constraints you need and what the target goals are. And then you should actually test to see if your performance is actually better.

For instance, how portable does the replacement need to be? What processor/compiler is the target? Do we need to worry about misaligned memory? Do we need to worry about overlapping memory?

The OP's code seems to be assuming that both "read" and "write" are aligned on integer boundaries, even if the size is not a multiple of an int. For misaligned accesses the may be a minor performance hit (misaligned accesses suffer only a couple of extra cycles splitting up cache accesses), a major performance hit (every misaligned access generates an exception/interrupt for handling the access), or a fatal error (misaligned accesses are not supported).

Depending upon the level of compiler optimization, the OP's code may actually perform 1 comparison, 2 multiplications, and 4 additions for each int loop iteration, and 1 comparison and 8 additions for each char loop iteration. Most simple implementations require only 1 comparison and 2 additions (by using pointers instead of array indexing).

Even the idea of trying to optimize memcpy() raises flags. Have you done any testing to determine if memcpy() needs optimization? As others have indicated, memcpy() is usually highly optimized for standard usage while still designed to handle corner cases such as overlaps and misaligned copies. For instance, several implementation which I have looked at actually copy entire cache lines (or more) when possible instead of going byte by byte or "int" by "int".

kbw (9492)

memcpy is an intrinsic function on many compilers. I'd be surprised if you could write code in C that is faster.
http://en.wikipedia.org/wiki/Intrinsic_function

DeXecipher (458)

why was I reported????

Duthomhas (13310)

OP will not be able to produce code faster than standard strxxx(), memxxx() routines.

Not only is it optimized assembly, it is optimized to handle corner cases where data to move is not aligned in convenient ways.

And, IMO, if your compiler doesn't have these functions that optimized, your compiler is junk.

BTW, it is hubris to think you can do better than standard routines that are at the core of optimized routines in use for over two decades.

If you want to take issue with something, you must have at minimum the following:
(1) benchmarks
(2) code that proves better performance on the same benchmarks

Good luck!

Topic archived. No new replies allowed.

C++

Forum

Faster memcpy routine