memory transfer using SSE

Hi guys

I have a custom image copy system that i am working on that is multithreaded. As such, I dont think i dare use memcpy.

I would like to try SSE to speed things up but I havent been able to find a clear example of how one does this.

Any ideas?

Thanks
JB
Hello JB,

This sounds like a problem where there probably aren't any online examples.

Which SSE did you want to use? What's this got to do with the multithreadedness? Are you aware that SSE is a processor technology and has no real affinity to C++?

If you know the SSE instructions you want to use you can write them in assembly, wrap it in a C function, and link it into a C++ program. Some compilers let you add in-line assembly, but support for that is diminishing.

Or perhaps you want to know SSE instructions that could help? As far as I can tell SSE helps with calculations rather than copying, but is there something more sinister going on in your 'copies'?
Which SSE did you want to use? What's this got to do with the multithreadedness? Are you aware that SSE is a processor technology and has no real affinity to C++?


SSE2 or 3. Whichever is better. I have several threads doing the same bit transfer and im guessing that the memcpy function will not go down well with multithreading. And yes I am aware that it is not c++...

Or perhaps you want to know SSE instructions that could help? As far as I can tell SSE helps with calculations rather than copying, but is there something more sinister going on in your 'copies'?


This. The pixel format I am using is 32bit. If i can transfer a 128bit block of 4 pixels at a time it will speed things up quite a bit.
If you want to use c++ use intrinsics for it. They compile to sse instructions using the mmintrin headers

http://stackoverflow.com/questions/11228855/header-files-for-simd-intrinsics

these are all the sse instructions.
http://softpixel.com/~cwright/programming/simd/sse.php

You could ask stackoverflow for your case incase you get stuck, I've had good help there but you need to show you're trying.
I'm still new to SSE myself so I can't help much more than that.
Thanks for that. It looks like the movntps and pshufw instructions are closest to what im looking for. However I am somewhat scared to try them without certainty...For those in the know, will the movntps instruction make the shift from source to dest without setting the source to 0?

The description for pshufw is a little vague to me though...anyone can explain in more detail?
However I am somewhat scared to try them without certainty..


I suggest you keep copies of variables then but I doubt the values in the xmm registers get destroyed unless an instruction is supposed to move the result to the xmm register you used for input, in this case the values get sent to memory without polluting the cache.

Heres a site (under Streaming Store Instructions for movntps), you can use google to find lots of info on sse instructions, one by one.

http://www.songho.ca/misc/sse/sse.html

Heres a link for pshufw

http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/vc254.htm

Shuffles the words in mm2/m64 using the imm8 operand to select which of the four words in mm2/mem will be placed in each of the words in MM1. Bits 1 and 0 of imm8 encode the source for destination word 0 (MM1[15-0]), bits 3 and 2 encode for word 1, bits 5 and 4 encode for word 2, and bits 7 and 6 encode for word 3 (MM1[63-48]). Similarly, the two-bit encoding represents which source word is to be used, e.g., a binary encoding of 10 indicates that source word 2 (MM2/Mem[47-32]) will be used.

Copies words from the source operand (second operand) and inserts them in the destination operand (first operand) at word locations selected with the order operand (third operand). This operation is similar to the operation used by the PSHUFD instruction, which is illustrated in Figure 3-10. For the PSHUFW instruction, each 2-bit field in the order operand selects the contents of one word location in the destination operand. The encodings of the order operand fields select words from the source operand to be copied to the destination operand.

The source operand can be an MMX(TM) technology register or a 64-bit memory location. The destination operand is an MMX register. The order operand is an 8-bit immediate.

Note that this instruction permits a word in the source operand to be copied to more than one word location in the destination operand.

Operation

DEST[15-0] (SRC >> (ORDER[1-0] * 16) )[15-0]
DEST[31-16] (SRC >> (ORDER[3-2] * 16) )[15-0]
DEST[47-32] (SRC >> (ORDER[5-4] * 16) )[15-0]
DEST[63-48] (SRC >> (ORDER[7-6] * 16) )[15-0]
IntelĀ® C++ Compiler Intrinsic Equivalent

PSHUFW __m64 _mm_shuffle_pi16(__m64 a, int n)


I know, their explanation isn't the greatest either, which is why trial and error is the best way to do this. Don't worry your cpu will not explode using this stuff.
Last edited on
have several threads doing the same bit transfer and im guessing that the memcpy function will not go down well with multithreading

What makes you think so, how would it be different with a hand-written loop, and what do you mean by "not go down well"??

The pixel format I am using is 32bit. If i can transfer a 128bit block of 4 pixels at a time it will speed things up quite a bit.

Speed things up compared to what? How many bits at a time does std::copy/memcpy send on your compiler/platform? Is that even a bottleneck?

That said, do write and compare loops using different CPU instructions, it makes for a pleasant evening.

It looks like the movntps and pshufw instructions are closest to what im looking for

If MOVNTPS is something that works for you, check out VMOVNTPS too, if your CPU supports AVX (just don't mix them with SSE!). And don't forget to set up a baseline test with plain old MOVDQA.
Last edited on
Topic archived. No new replies allowed.