well, that sure didn't work very well...

Pages: 1234
std::inner_product() works on iterators; if nyq_coeffs is a vector, the call should be:
*nyq_iout = inner_product( nyq_coeffs.begin(), nyq_coeffs.end(), nyq_delay_i.begin(), 0 ) ;
Oh, of course. I seem to always goof up iterator stuff. Fixed now.

Use of the inner_product did seem to have a pretty good effect on run time. The program is down to 103 seconds.

Now, regarding a couple of the other suggestions people made:

1. is it really going to make much difference if I change my ints from 32-bit to 16? Seems kind of hard to believe, but I'll do it if people think it will really help.
2. The SSE4.1 suggestion: is that just a matter of setting that compiler flag, or do I need to make code changes to take advantage of it?

> Use of the inner_product did seem to have a pretty good effect on run time.

That improvent comes from using iterators instead of subscript operatrors on the vector<>. If you rewrite the original code using iterators instead of subscripts, that would actually be slightly faster than the inner_product version. The GNU compiler has always had problems in generating efficient code for iteration using subscript operators.


> is it really going to make much difference if I change my ints from 32-bit to 16?

The more important question is: 'Is the performance with the code as of now acceptable?' If it is, let it be - performance is a constraint, not a goal.

If you do intend to play around with switching back and forth between 32-bit and 16-bit integers, use a typedef; or make the function a template. And verify that the range of 16-bit integers are adequate for the program as a whole.


> The SSE4.1 suggestion: is that just a matter of setting that compiler flag,
> or do I need to make code changes to take advantage of it?

At the simplest level, it is just setting a compiler flag.

If the program also needs to run or processors that do not have native SSE4 support, this will turn out to be conterproductive. Or else you would need to compile multiple binaries with and without SSE4, check the processsor architecture at runtime, and choose the right binary. Do things of this kind only if the current performance is unacceptable.
ne555: thanks for the suggestions. Putting the insert after the pop did indeed shave off a few seconds.

Regarding the order of the elements in nyq_delay_i/q, your point is well taken. It probably doesn't matter in this application, as the vector only gets summed up, but I'd like to make it "correct." Unfortunately, I tried replacing:

iter = nyq_delay_i.begin();

with

iter = nyq_delay_i.end();

And that blew up (since I guess it points to the first invalid reference beyond the vector). So, then I tried:

iter = nyq_delay_i.rbegin();

But the compiler objects to this. Should I do something like this?

iter = nyq_delay_i.end() - 1;
It does matter. You are making an inner product.
end() is a valid position to insert, but I guess that you invalidate it after an erase.
1
2
v.erase(v.begin());
v.push_back( T );
You're right...it does matter. The behavior of the vector implementation is consistent with the (original) array implementation; it was the cbuffers that was backwards. But, since that doesn't seem to be performing any faster, I won't be pursuing that route.

So...which compiler flags does it look like are appropriate for this? Any suggestions there?

Thanks.
1. is it really going to make much difference if I change my ints from 32-bit to 16? Seems kind of hard to believe, but I'll do it if people think it will really help.

You can process twice as many values per instruction. And you only need SSE2 for 16-bit values (_mm_mullo_epi16 is part of SSE2, _mm_mullo_epi32 is not). Unlike SSE4.1, SSE2 is almost universally available on desktop PCs.

2. The SSE4.1 suggestion: is that just a matter of setting that compiler flag, or do I need to make code changes to take advantage of it?

You need to use the appropriate SSE intrinsics. In theory, setting the compiler flag would do the trick, however most compilers are currently not very good at vectorizing. It's more a matter of luck.

FYI, with this code my results are down to 149 ms, which is 6.4 times faster than the original.
However, since the sum is 16 bit too, it might overflow depending on what values the function is called with.
If that's a problem, more changes are needed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
void wci_nyq_filt ( intType i_symb,				// I symbol input
                    intType q_symb,				// Q symbol input
                    intType *nyq_coeffs,		// Nyquist coefficients for I and Q
                    intType coeff_len,			// Coefficient length
                    ContainerType nyq_delay_i,		// I delay line storage
                    ContainerType nyq_delay_q,		// Q delay line storage
                    intType *nyq_iout,	 		// I output from Nyquist filter
                    intType *nyq_qout) 			// Q output from Nyquist filter
{
  //Compute the next filter output
  const int valsPerXMM=16/sizeof(intType);
  const int coeff_sselen=coeff_len/valsPerXMM*valsPerXMM;

  __m128i* i_dq=(__m128i*)nyq_delay_i;
  __m128i* q_dq=(__m128i*)nyq_delay_q;
  __m128i* c_dq=(__m128i*)nyq_coeffs;
  __m128i dqsum_i=_mm_set1_epi8(0);
  __m128i dqsum_q=_mm_set1_epi8(0);
  for (int i = 0 ; i < coeff_sselen ; i+=valsPerXMM )
  {
    __m128i iv=_mm_loadu_si128(i_dq++);
    __m128i cv=_mm_loadu_si128(c_dq++);
    __m128i qv=_mm_loadu_si128(q_dq++);
    __m128i mi=_mm_mullo_epi16(iv,cv);
    __m128i mq=_mm_mullo_epi16(qv,cv);
    dqsum_i=_mm_add_epi16(dqsum_i,mi);
    dqsum_q=_mm_add_epi16(dqsum_q,mq);
  }

  intType sums_i[valsPerXMM];
  intType sums_q[valsPerXMM];
  _mm_storeu_si128((__m128i*)sums_i,dqsum_i);
  _mm_storeu_si128((__m128i*)sums_q,dqsum_q);

  intType sum_i = accumulate(sums_i,sums_i+valsPerXMM,intType(0));
  intType sum_q = accumulate(sums_q,sums_q+valsPerXMM,intType(0));

  //process rest without SSE
  for (int i = coeff_sselen ; i < coeff_len ; i++ )
  {
    sum_i += nyq_delay_i[i] * nyq_coeffs[i];
    sum_q += nyq_delay_q[i] * nyq_coeffs[i];
  }

  *nyq_iout = sum_i ;
  *nyq_qout = sum_q ;

  //put in front
  nyq_delay_i[-1] = i_symb ;
  nyq_delay_q[-1] = q_symb ;
}
Last edited on
For this code, the performance with SSE is going to depend quite lot on how large the value of coeff_len is. If it is small enough, and the data that is needed will fit into the CPU cache, SSE instructions will greatly improve performance.

If not, the trade off between memory access instructions (slow) and computation instructions (fast) comes into play. SSE requires data to be moved between memory and the SSE registers. SSE is typically contraindicated when a small number of computations have to be performed on a large data set that does not fit into the L1 data cache.
it was the cbuffers that was backwards. But, since that doesn't seem to be performing any faster, I won't be pursuing that route.

http://cplusplus.com/forum/general/63574/2/#msg345150 1<<24 iterations
boost::circular_buffer with -O2 for optimization took 71s (it fills in the same way)
Your implementation with vector took 126s
JL: this routine is called twice. In one call, coeff_len is 2048; in the other, it's 9. Guess that doesn't make it an easy decision, does it?
SSE requires data to be moved between memory and the SSE registers.

? The values are moved from memory to registers whether you use SSE or not.

Guess that doesn't make it an easy decision, does it?

The call with coeff_len=9 is irrelevant. With coeff_len=2048, the original takes 20.3 seconds for me, the SSE version 1.3 seconds.
Last edited on
> ? The values are moved from memory to registers whether you use SSE or not.

Yes. But, data pre-fetching (to reduce memory latency) is effective only when computations can be carried out without having to wait for a large amount of data to be available.
http://www.multicoreinfo.com/prefetching-multicore-processors/


> coeff_len is 2048;

Assuming that a 16-bit integer is adequate for the program, this would need an L1 data cache of about 16KB. This should be available on most modern processors (for example, the Q9300 has a 32K L1 data cache).
I just tried 16-bit integers...no good.

Right now, ne555's circular buffer is the fastest -- it's almost 4 times faster than the circular buffer technique I used. It's also about 20% faster than the original array processing, which (surprisingly to me) is in turn a little faster than my current effort with vectors and inner_product:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
void wci_nyq_filt ( intSigGen i_symb,				// I symbol input
					intSigGen q_symb,				// Q symbol input
					vector<intSigGen> &nyq_coeffs,		// Nyquist coefficients for I and Q
					vector<intSigGen> &nyq_delay_i,		// I delay line storage
					vector<intSigGen> &nyq_delay_q,		// Q delay line storage
					intSigGen *nyq_iout,	 		// I output from Nyquist filter
					intSigGen *nyq_qout) 			// Q output from Nyquist filter
{
	vector<intSigGen>::iterator	iter;

//	Compute the next filter output

	*nyq_iout = inner_product(nyq_coeffs.begin(),
							  nyq_coeffs.end(),
							  nyq_delay_i.begin(),
							  0);
	*nyq_qout = inner_product(nyq_coeffs.begin(),
							  nyq_coeffs.end(),
							  nyq_delay_q.begin(),
							  0);

//	Shift the delay lines and put the next symbol in the delay lines

	nyq_delay_i.pop_back();
	iter = nyq_delay_i.begin();
	nyq_delay_i.insert(iter, i_symb);

	nyq_delay_q.pop_back();
	iter = nyq_delay_q.begin();
	nyq_delay_q.insert(iter, q_symb);
}


Someone suggested that using iterators instead of the inner_product stuff would be a little faster. I think I'd need two iterators, though, one for each array being multiplied, wouldn't I? And, wouldn't this slow things down a little due to the loop overhead?

EDIT: I just took a closer look at the output of ne555's routine when used in my program, and it doesn't match that of the other three. Currently, I guess the original implementation is the fastest working one. Well, this was a good exercise, anyway.
Last edited on
I'm not seeing that behaviour here, if you want pass me your test data.
Check out if I'm doing something stupid (like out of bounds)
I worked under the assumption that the queues start full, and that they are both the same size of the array. (COEFF_LEN)*


... or check the other three, xP

Edit:
*it should say
The queues have the same capacity, that is the size of the array (COEFF_LEN)
Last edited on
I worked under the assumption that the queues start full


Ahh...no, not in all cases. This routine is called in two places. One begins with empty vectors (I changed them from arrays awhile back).

, and that they are both the same size of the array.


Yes, that should be true.

... or check the other three, xP


Heh...normally, I'm the first to assume I'm the one with the mistake, but...the other techniques produce output that, when passed to another program, produce nice QPSK plots. That's pretty strong evidence those are working correctly.
JLBorges -

When you mentioned earlier replacing the inner_product with iterators, I tried this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
void wci_nyq_filt ( intSigGen i_symb,				// I symbol input
					intSigGen q_symb,				// Q symbol input
					vector<intSigGen> &nyq_coeffs,		// Nyquist coefficients for I and Q
					vector<intSigGen> &nyq_delay_i,		// I delay line storage
					vector<intSigGen> &nyq_delay_q,		// Q delay line storage
					intSigGen *nyq_iout,	 		// I output from Nyquist filter
					intSigGen *nyq_qout) 			// Q output from Nyquist filter
{
//	intSigGen i;
	intSigGen sumI, sumQ;
	vector<intSigGen>::iterator	iter;
	vector<intSigGen>::iterator	iterCoeffs;
	vector<intSigGen>::iterator	iterDelayI;
	vector<intSigGen>::iterator	iterDelayQ;

//	Compute the next filter output

	sumI = sumQ = 0;

	for (iterCoeffs = nyq_coeffs.begin(),
		 iterDelayI = nyq_delay_i.begin(),
		 iterDelayQ = nyq_delay_q.begin();
				iterCoeffs != nyq_coeffs.end();
					++iterCoeffs,
					++iterDelayI,
					++iterDelayQ)
	{
		sumI += *iterDelayI * *iterCoeffs;
		sumQ += *iterDelayQ * *iterCoeffs;
	}
	*nyq_iout = sumI;
	*nyq_qout = sumQ;

//	*nyq_iout = inner_product(nyq_coeffs.begin(),
//							  nyq_coeffs.end(),
//							  nyq_delay_i.begin(),
//							  0);
//	*nyq_qout = inner_product(nyq_coeffs.begin(),
//							  nyq_coeffs.end(),
//							  nyq_delay_q.begin(),
//							  0);

//	Shift the delay lines and put the next symbol in the delay lines

	nyq_delay_i.pop_back();
	iter = nyq_delay_i.begin();
	nyq_delay_i.insert(iter, i_symb);

	nyq_delay_q.pop_back();
	iter = nyq_delay_q.begin();
	nyq_delay_q.insert(iter, q_symb);
}


I'm sure there's a more efficient way to do this (somehow re-using the iterators). As it sits, it's about 50% slower than before. Any suggestions? Thanks...
On a second thought, there should be no difference between an empty queue and one filled with 0.
As long as the capacity has the right value...
this routine is called twice. In one call, coeff_len is 2048; in the other, it's 9. Guess that doesn't make it an easy decision, does it?
¿do you use other queues, or do you simply 'resize' them? boost::circular_buffer::resize does not work as std::vector::resize
Dunno what else to try

Any suggestions?
You could parallelize it. (inner_product is 'easy' as the data is independent)
ne555:

Here's a copy of the entire program, and the two data files it needs to run:

http://www.mediafire.com/?fcm583e22flg1zm,wp9hnruy4ystugy,8rlltb0wugxdt47

You can replace my wci_nyq_filt() routine with yours, and verify that the output is different.

You might want to reduce the constant NUM_SYMBOLS in the program as well, just to get it to run a little faster.
I wrote:
and that they (the queues) are both the same size of the array.
mzimmers wrote:
Yes, that should be true.
1
2
3
4
     vector<intSigGen>   res_coeffs(9, 0) ;
//...
     vector <intSigGen>  res_delay_i(16, 0);
     vector <intSigGen>  res_delay_q(16, 0);
Liar.
After make them of capacity 9, I've got the same results.

By the way, you shouldn't
1
2
while(file.good)
  file>>var;
but just
1
2
while(file>>var)
  ;


Now seriously, ¿is the time good yet?

off topic:
¿when did mediafire get so bloated?
Last edited on
Oh, my apologies. I was so focused on the large-array call of this routine that I forgot about the little resampler.

I'd like to do another speed test, using the current code with circular_buffers, then comparing it to your routine. Now that I've fixed this bug you found, is there any reason that the current routine shouldn't work?
Pages: 1234