memcpy

Pages: 123

@helios

This argument doesn't really hold up to scrutiny

I'm simply sharing my programming experience, which may be different from other people's programming experience. Your mileage may vary, when in doubt, profile and check assembly, as you did.

I agree that many C++ implementations do in fact compile in a call to memcpy(), memmove(), or some non-standard precompiled equivalent when copy()'ing vectors and arrays of trivially-copyable types (including your POD type, in my tests). This is a good optimization in the general case, although it may be suboptimal in the situations I've mentioned earlier. Given it some thought (and tests), I'm changing "is often the slowest" to "is occasionally the slowest" in my earlier post.

Last edited on

helios (17509)

memcpy() should almost never be slower than any other method, since it's specifically written to be the way to copy raw blocks of data. Using SIMD instructions might be the only exception, though there's no reason why memcpy() can't use said instructions. In fact, I wouldn't be surprised if ICC's runtime was capable of using them.

Cubbi (4774)

Yes, ICC's _intel_fast_memcpy is what I meant' by "non-standard equivalent".

Besides potentially different CPU support between the C library and the C++ compiler, my only other point was contextual optimization. For a trivial example, not many compilers eliminate redundant calls to memcpy() (was going to say none, but clang++ just eliminated them for me), while redundant assignment elimination is common.

Athar (4466)

Any sensible implementation will make sure that the most optimized variant available for the machine's instruction set ends up being executed when calling memcpy.

So I guess cache is the culprit?
Meaning that the b array must be assigned new random content in between the tests?

It doesn't have anything to do with cache, it's just that whichever piece of code goes over the array first runs into a lot of page faults when the memory is accessed for the first time.

helios (17509)

redundant assignment elimination is common

Do you mean loops are "trimmed" when, say, two consecutive std::copy() calls overlap? Or something different?

Athar (4466)

So what I get after having done an initial copy:

memcpy()	115
for loop	137
std::copy()	137

One thing to observe is that the for loop uses a short loop copying 16 bytes at a time with movdqu, while std::copy generates a call to memmove, which likely does the same. Compiling with -march=native causes the compiler to unroll the for loop and to use prefetch instructions, improving the time to 124 ms.

moorecm (1932)

It would also be worth trying a for loop that just increments a pointer rather than using the subscript operator.

Gaminic (1621)

1
2

for (i = 0; i < icount; ++i)  *to++ = *from++;
for (i = icount; --i > 0;) *to++ = *from++;

Both clock at 0.55sec, (not necessarily significantly) faster than the regular for loop [which is still faster than all the others].

kooth (746)

Just to help completeness, memcpy often does not take into consideration memory areas that overlap! The memmove() function can handle overlapping memory. There was a recent bug in LINUX where memcpy was broken (http://sourceware.org/bugzilla/show_bug.cgi?id=12518). Use memmove() instead!

Peter87 (11179)

memcpy was not broken. It was just that many people used it incorrectly.

moorecm (1932)

Here's a little information about the GNU implementations to use as a reference:

This snippet is from the GNU implementation of the STL (2.90.8). The file is stl/bits/stl_algobase.h.

//--------------------------------------------------
// copy

// All of these auxiliary functions serve two purposes.  (1) Replace
// calls to copy with memmove whenever possible.  (Memmove, not memcpy,
// because the input and output ranges are permitted to overlap.)
// (2) If we're using random access iterators, then write the loop as
// a for loop with an explicit count.

template <class _InputIter, class _OutputIter, class _Distance>
inline _OutputIter __copy(_InputIter __first, _InputIter __last,
                          _OutputIter __result,
                          input_iterator_tag, _Distance*)
{
  for ( ; __first != __last; ++__result, ++__first)
    *__result = *__first;
  return __result;
}

template <class _RandomAccessIter, class _OutputIter, class _Distance>
inline _OutputIter
__copy(_RandomAccessIter __first, _RandomAccessIter __last,
       _OutputIter __result, random_access_iterator_tag, _Distance*)
{
  for (_Distance __n = __last - __first; __n > 0; --__n) {
    *__result = *__first;
    ++__first;
    ++__result;
  }
  return __result;
}

template <class _Tp>
inline _Tp*
__copy_trivial(const _Tp* __first, const _Tp* __last, _Tp* __result) {
  memmove(__result, __first, sizeof(_Tp) * (__last - __first));
  return __result + (__last - __first);
}

From glibc 2.11.1, in string/memmove.c:

/* Copy memory to memory until the specified number of bytes
   has been copied.  Overlap is handled correctly.
   Copyright (C) 1991, 1995, 1996, 1997, 2003 Free Software Foundation, Inc.
   This file is part of the GNU C Library.
   Contributed by Torbjorn Granlund (tege@sics.se).

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, write to the Free
   Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
   02111-1307 USA.  */

#include <string.h>
#include <memcopy.h>
#include <pagecopy.h>

/* All this is so that bcopy.c can #include
   this file after defining some things.  */
#ifndef	a1
#define	a1	dest	/* First arg is DEST.  */
#define	a1const
#define	a2	src	/* Second arg is SRC.  */
#define	a2const	const
#undef memmove
#endif
#if	!defined(RETURN) || !defined(rettype)
#define	RETURN(s)	return (s)	/* Return DEST.  */
#define	rettype		void *
#endif


rettype
memmove (a1, a2, len)
     a1const void *a1;
     a2const void *a2;
     size_t len;
{
  unsigned long int dstp = (long int) dest;
  unsigned long int srcp = (long int) src;

  /* This test makes the forward copying code be used whenever possible.
     Reduces the working set.  */
  if (dstp - srcp >= len)	/* *Unsigned* compare!  */
    {
      /* Copy from the beginning to the end.  */

      /* If there not too few bytes to copy, use word copy.  */
      if (len >= OP_T_THRES)
	{
	  /* Copy just a few bytes to make DSTP aligned.  */
	  len -= (-dstp) % OPSIZ;
	  BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);

	  /* Copy whole pages from SRCP to DSTP by virtual address
	     manipulation, as much as possible.  */

	  PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);

	  /* Copy from SRCP to DSTP taking advantage of the known
	     alignment of DSTP.  Number of bytes remaining is put
	     in the third argument, i.e. in LEN.  This number may
	     vary from machine to machine.  */

	  WORD_COPY_FWD (dstp, srcp, len, len);

	  /* Fall out and copy the tail.  */
	}

      /* There are just a few bytes to copy.  Use byte memory operations.  */
      BYTE_COPY_FWD (dstp, srcp, len);
    }
  else
    {
      /* Copy from the end to the beginning.  */
      srcp += len;
      dstp += len;

      /* If there not too few bytes to copy, use word copy.  */
      if (len >= OP_T_THRES)
	{
	  /* Copy just a few bytes to make DSTP aligned.  */
	  len -= dstp % OPSIZ;
	  BYTE_COPY_BWD (dstp, srcp, dstp % OPSIZ);

	  /* Copy from SRCP to DSTP taking advantage of the known
	     alignment of DSTP.  Number of bytes remaining is put
	     in the third argument, i.e. in LEN.  This number may
	     vary from machine to machine.  */

	  WORD_COPY_BWD (dstp, srcp, len, len);

	  /* Fall out and copy the tail.  */
	}

      /* There are just a few bytes to copy.  Use byte memory operations.  */
      BYTE_COPY_BWD (dstp, srcp, len);
    }

  RETURN (dest);
}
#ifndef memmove
libc_hidden_builtin_def (memmove)
#endif

kooth (746)

memcpy() wasn't broken? I beg to differ:

Linus Torvalds 2010-11-06 12:50:36 EDT
(In reply to comment #29)
> Looking at the changelog for glibc-2.12.90-4 shows:
>
> * Fri Jul 02 2010 Andreas Schwab <schwab@redhat.com> - 2.12.90-4
> [...]
> - Improve 64bit memcpy/memmove for Atom, Core 2 and Core i7
> [...]
> I used chromium to run valgrind on the flash plugin [...]
> ==2100== Thread 9:
> ==2100== Source and destination overlap in memcpy(0x256d7170, 0x256d7570, 1280)
> ==2100== at 0x4A06A3A: memcpy (mc_replace_strmem.c:497)

That does look like a very possible smoking gun.

Normally, a memcpy that copies _downwards_ (like the one above) will work
perfectly well in practice, because the "natural" way to do memcpy() by making
it just copy things upwards will "just work" even for the overlapping case.

So it would be a bug to use memcpy for overlapping areas, but it would be a
bug that normally would never show up.

But if the new improved 64bit memcpy started copying things backwards, it might
cause trouble with such an overlapping memcpy user.

[ That said - why the heck would you ever do memcpy() backwards? Things like
automatic prefetchers usually work better for the "natural" patterns, so a
backwards memcpy is generally a bad thing to do unless you have some active
reason for it, like doing a memmove() ]

> On a related note, is there a glibc environment flag that can be set to force a
> generic memcpy routine to run?

Indeed, that would be very interesting to see.

Andreas?

Peter87 (11179)

I still think memcpy() wasn't broken and the bug is to blame on whoever implemented the flash plugin.

helios (17509)

Normally, a memcpy that copies _downwards_ (like the one above) will work
perfectly well in practice, because the "natural" way to do memcpy() by making
it just copy things upwards will "just work" even for the overlapping case.

Don't both of them break in different cases? Upwards copy breaks when the start of dst is inside src, and downwards copy breaks when the end of dst is inside src.

kooth (746)

Peter87, I think practically-speaking you are right: The worse problem is that the memcpy() calls were using overlapping code; however, as Linus points out, the implementers have a duty to be backwards compatible, and their change "broke" this bad flash code. So I'll agree with you and split this hair. You can have the larger piece, even. ;)

BTW, we use Faircom C-Tree for a database engine (don't ask) and they had a similar problem too!

Thanks for your input!

Catfish (666)

So the conclusion of this thread is that memcpy() is faster (or should be) than for loops written for the same purpose, except... in Gaminic's tests?

Also Gaminic how did you originally benchmark, I suppose you used the debugger/profiler, external to your program?
And could we have the relevant source? I still wonder if icount is a macro or a variable.

Gaminic (1621)

Nothing that fancy. I wouldn't know how to make an actual benchmark; I just copied the code I would use in my project to a separate file and timed it using clock(). I didn't do any command line twiddling, because for the project I'm working on I won't have control over the final compiler settings. Here's the code:

#include <iostream>
#include <ctime>

//#define icount 10000000
const int icount = 10000000;
const int msize = icount*sizeof(int);
int a[icount], b[icount], i;
int main() {
	clock_t start_time, elapsed_time;
	int *to, *from;
	for (i = 0; i < icount; ++i) a[i] = i*i-i;
	start_time = clock();
	for (int j = 0; j < 50; ++j) {
	//	to = b;
	//	from = a;
		memcpy(b, a, msize); // 0.67s // 640
	//	for (i = icount; --i > 0;) b[i] = a[i]; // 0.67s // 640
	//	for (i = 0; i < icount; ++i) b[i] = a[i]; // 0.56sec // 546
	//	for (i = 0; i < icount; ++i)  *to++ = *from++; // 0.55sec // 546
	//	for (i = icount; --i > 0;) *to++ = *from++; // 0.55sec // 546
	//	std::copy(a, a+icount, b); // 0.67s // 640
	}
	elapsed_time = clock();
	printf ("\n	Computation time: %d", elapsed_time-start_time);
	return 0;
}

I tried both icount as a macro and a const int (thought it might be faster to check the loop vs a literal compared to vs a variable), but it had no effect. I tried adding a no-bounds-check macro as proposed somewhere on the previous page, but apparently that's already in the standard settings of release mode on MSVStudio2010.

Feel free to comment on my barbaric benchmarking methods. I'd appreciate some tips on how to do it properly (I have a profiler, but I haven't figured out anything beyond "function X takes up most of my time", which isn't useful here), or any general tips.

Last edited on

helios (17509)

Actually, for the special case of int, simple loops do seem to generate code that's comparable to memcpy()'s. Are your times averaged? The min and max times for this type of code can be up to 30% apart.

By the way, would someone like to take a stab as to why this thing's times get really short after the first iteration? Even if I increase N to 500 MiB, the time still drops to zero, which is absurd.

#include <iostream>
#include <iomanip>
#include <fstream>
#include <algorithm>
#include <cstdlib>
#include <ctime>

unsigned N=100*1024*1024;

struct POD{
	char a[10];
};

template <typename T>
void f0(T *dst,const T *src){
	std::copy(src,src+N,dst);
}

template <typename T>
void f1(T *dst,const T *src){
	memcpy(dst,src,N*sizeof(T));
}

template <typename T>
void f2(T *dst,const T *src){
	for (unsigned a=0;a<N;a++)
		dst[a]=src[a];
}

template <typename T>
void f3(T *dst,const T *src){
	const T *s=src;
	T *d=dst;
	for (unsigned a=0;a<N;a++)
		*d++=*s++;
}

template <typename T>
T *make_array(){
	T *buffer=new T[N];
	for (unsigned a=0;a<N;a++)
		buffer[a]=(T)rand();
	return buffer;
}

template <>
POD *make_array(){
	POD *buffer=new POD[N];
	POD temp;
	for (int a=0;a<10;a++)
		temp.a[a]=a+1;
	for (unsigned a=0;a<N;a++)
		buffer[a]=temp;
	return buffer;
}

template <typename T,void F(T *,const T *)>
double benchmark(){
	T *src=make_array<T>();
	T *dst=make_array<T>();
	unsigned t0,t1;
	t0=clock();
	F(dst,src);
	t1=clock();
	double r=double(t1-t0)/double(CLOCKS_PER_SEC);
	std::cout <<std::setprecision(6)<<r<<std::endl;
	std::ofstream file("test.bin",std::ios::binary);
	file.write((const char *)dst,10);
	file.close();
	delete[] src;
	delete[] dst;
	return r;
}

struct results{ double r[4]; };

template <typename T>
results benchmark(){
	N/=sizeof(T);
	results r;
	r.r[0]=benchmark<T,f0>();
	r.r[1]=benchmark<T,f1>();
	r.r[2]=benchmark<T,f2>();
	r.r[3]=benchmark<T,f3>();
	return r;
}

int main(){
	srand(time(0));
	results avg;
	std::fill(avg.r,avg.r+4,0);
	for (int a=0;a<10;a++){
		results r=benchmark<int>();
		for (int b=0;b<4;b++)
			avg.r[b]+=r.r[b];
	}
	for (int a=0;a<4;a++)
		avg.r[a]/=10;
	std::cout <<"\nAverages:\n"
		"std::copy:         "<<avg.r[0]<<"\n"
		"memcpy():          "<<avg.r[1]<<"\n"
		"for with offsets:  "<<avg.r[2]<<"\n"
		"for with pointers: "<<avg.r[3]<<"\n";
	return 0;
}

Last edited on

Gaminic (1621)

Not really following with your code, but I have seen that my compiler is able to filter out unused stuff. My option is usually to do a small computation using numbers generated (e.g. adding the last element to a useless sum var which is printed at the end).

helios (17509)

I have seen that my compiler is able to filter out unused stuff

I checked the compiler output to make sure it wasn't getting getting smart with me. The functions it calls in each iteration are the same.

Last edited on

Pages: 123