I thought int was 32 bit on most machines. And the speed for that code will not be portable for every machine. You would have te detect the standard word-size for that system. And int is not always that.
Int is 64-bit on most 64 machines right?
The OP's code has a big portability bug. In the int loop, "sizeof(int)" should be used instead of hardcoding to 4. As others have hinted, the size of an int isn't fixed and may vary between processors, compilers, and even compiler options. The only known constraints are that the minimum range for an int is 16 bits.
Before considering this as a "faster" replacement for memcpy(), you should determine what (if any) constraints you need and what the target goals are. And then you should actually test to see if your performance is actually better.
For instance, how portable does the replacement need to be? What processor/compiler is the target? Do we need to worry about misaligned memory? Do we need to worry about overlapping memory?
The OP's code seems to be assuming that both "read" and "write" are aligned on integer boundaries, even if the size is not a multiple of an int. For misaligned accesses the may be a minor performance hit (misaligned accesses suffer only a couple of extra cycles splitting up cache accesses), a major performance hit (every misaligned access generates an exception/interrupt for handling the access), or a fatal error (misaligned accesses are not supported).
Depending upon the level of compiler optimization, the OP's code may actually perform 1 comparison, 2 multiplications, and 4 additions for each int loop iteration, and 1 comparison and 8 additions for each char loop iteration. Most simple implementations require only 1 comparison and 2 additions (by using pointers instead of array indexing).
Even the idea of trying to optimize memcpy() raises flags. Have you done any testing to determine if memcpy() needs optimization? As others have indicated, memcpy() is usually highly optimized for standard usage while still designed to handle corner cases such as overlaps and misaligned copies. For instance, several implementation which I have looked at actually copy entire cache lines (or more) when possible instead of going byte by byte or "int" by "int".