well, that sure didn't work very well...

Pages: 1234
Thanks for the suggestion. I don't currently use boost. My last experience with it was rather unsuccessful. It took forever to build, used a ton of disk space, and I couldn't figure out how to use it.

This was a very interesting exercise, but barring a relatively simple solution, I'm probably going to leave it alone. (Or turn my attention to more high-level optimization as JLB suggested.
> What does the registers enclosed in parens mean?

Memory operands.

You might be familiar with the Intel assembly syntax (which hasn't fundamentally changed over the years). GCC emits assembly code in AT&T syntax: http://asm.sourceforge.net/articles/linasm.html#Syntax

What kind of assembly does this produce:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void foo( int32_t *a, int32_t *b, int32_t array_len,
           int* coeffs, int* x, int* y )
{
      int32_t local_a, local_b, local_coeef;
    *x = 0 ;
    *y = 0 ;

    for( int i = 0; i < array_len; i++ )
    {
          local_a = a[i]; local_b = b[i]; local_coeff = coeff[i];
        *x += local_a * local_coeff;
        *y += local_b * local_coeff;
    }
}


Also, what is the performance time wise?
Last edited on
With a typo fixed, and adding a using std::int32_t ; at the top:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void foo_modified( int32_t *a, int32_t *b, int32_t array_len,
                    int* coeffs, int* x, int* y )
{
      int32_t local_a, local_b, local_coeff;
    *x = 0 ;
    *y = 0 ;

    for( int i = 0; i < array_len; i++ )
    {
          local_a = a[i]; local_b = b[i]; local_coeff = coeffs[i];
        *x += local_a * local_coeff;
        *y += local_b * local_coeff;
    }
}


The assembly generated is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
__Z12foo_modifiedPiS_iS_S_S_:
	pushl	%ebp
	xorl	%eax, %eax
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	movl	28(%esp), %edx
	movl	36(%esp), %esi
	movl	40(%esp), %ebx
	movl	20(%esp), %ebp
	testl	%edx, %edx
	movl	$0, (%esi)
	movl	$0, (%ebx)
	jle	L24
	.p2align 4,,7
L28:
	movl	24(%esp), %ecx
	movl	(%ecx,%eax,4), %edx
	movl	32(%esp), %ecx
	movl	(%ecx,%eax,4), %edi
	movl	0(%ebp,%eax,4), %ecx
	addl	$1, %eax
	imull	%edi, %edx
	imull	%edi, %ecx
	addl	%ecx, (%esi)
	addl	%edx, (%ebx)
	cmpl	28(%esp), %eax
	jne	L28
L24:
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret


This code (without the temporaries):

1
2
3
4
5
6
7
8
9
10
11
12
void foo_modified_once_more( int32_t *a, int32_t *b, int32_t array_len,
                              int* coeffs, int* x, int* y )
{
    *x = 0 ;
    *y = 0 ;

    for( int i = 0; i < array_len; i++ )
    {
        *x += a[i] * coeffs[i] ;
        *y += b[i] * coeffs[i] ;
    }
}


generated:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
__Z22foo_modified_once_morePiS_iS_S_S_:
	pushl	%ebp
	xorl	%eax, %eax
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	movl	28(%esp), %edx
	movl	36(%esp), %ebx
	movl	40(%esp), %ecx
	movl	20(%esp), %ebp
	testl	%edx, %edx
	movl	24(%esp), %edi
	movl	$0, (%ebx)
	movl	32(%esp), %esi
	movl	$0, (%ecx)
	jle	L31
	.p2align 4,,7
L35:
	movl	0(%ebp,%eax,4), %edx
	imull	(%esi,%eax,4), %edx
	addl	%edx, (%ebx)
	movl	(%edi,%eax,4), %edx
	imull	(%esi,%eax,4), %edx
	addl	$1, %eax
	addl	%edx, (%ecx)
	cmpl	28(%esp), %eax
	jne	L35
L31:
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret


As you can notice, the optimizer knows everything about loop invariants and constant folding.
Last edited on
ok, thanks for indulging me.
If anyone's interested, the compiler flags for the release build are:

-O2 -frtti -fexceptions -mthreads -Wall -DUNICODE -DQT_LARGEFILE_SUPPORT

I have no idea what any of those mean...
-D is defines sent to the compile process. UNICODE will pass UNICODE through your source, so any pre-processor define like:

1
2
3
#ifdef UNICODE
  compile this in
#endif 


will get processed and compiled in. Same for QT_LARGEFILE_SUPPORT.

I suspect the -mthread flag makes the compiler do its part to make the exe capable of supporting multi-threading.

-W tells the compiler what warning levels are want reported during the compile process, in this case, you're going to see all compiler warnings.

No idea on the -f flag.
Last edited on
RTFM
man gcc wrote:
-O is optimizing
-O2: GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff.

Options of the form -fflag specify machine-independent flags. Most flags have both positive and negative forms; the negative form of -ffoo would be -fno-foo
-frtti: run-time type information. Used by dynamic_cast and typeid
-fexceptions: Enable exception handling. Generates extra code needed to propagate exceptions.

-mmachine-option
-mthreads: Support thread-safe exception handling on Mingw32.


For all you care, you are compiling with optimizations.
You could play with -O3 or -Ofast (careful)
Constantly updating an outside variable via pointer is a horrible idea when it can be avoided.
Replace this part:
1
2
*nyq_iout += ...;
*nyq_qout += ...;

with a loop that sums up those values locally and then moves the result to the output variables.
I made gradual changes to the code and tested its speed each time.
All with gcc 4.6, coeff_len=100, 3m iterations.

original -Os:   1041 ms
original -O3:   950 ms
fixed shift:    905 ms
vector:                     914 ms
vector, fixed summation:    677 ms
vector, insert+pop_back:    934 ms
vector, revert previous, use iterators for summation: 613 ms
deque, insert+pop_back:     772 ms
list:                       961 ms
array, fixed summation:     586 ms
array, shifting in separate loops:      560 ms
array, moving buffer (double capacity): 358 ms
array, -funroll-loops:                  344 ms
array, std::inner_product:              382 ms


"fixed summation" refers to what I mentioned, "fixed shift" refers to the awkward calculation of j in the shift code instead of traversing the array backwards and "moving buffer" means changing the starting offset each iteration so the shifting becomes unnecessary (similar to a circular buffer).

Edit: you can also easily halve the time of 344 ms when you use SSE 4.1. If you can use int16 or floats instead of int32, you could do even better (and more importantly, it would only need SSE2 and SSE1, respectively).
Last edited on
Athar -

Thank you for the very detailed analysis. I can't say that I understand all of it, but it's been helpful nonetheless.

Interestingly enough, I implemented boost circular_buffers and found that the program is about 25% slower than with my improved vector implementation.

Here's the vector logic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
void wci_nyq_filt ( int32_t i_symb,				// I symbol input
					int32_t q_symb,				// Q symbol input
					int32_t *nyq_coeffs,		// Nyquist coefficients for I and Q
					int32_t coeff_len,			// Coefficient length
					vector<int32_t> &nyq_delay_i,		// I delay line storage
					vector<int32_t> &nyq_delay_q,		// Q delay line storage
					int32_t *nyq_iout,	 		// I output from Nyquist filter
					int32_t *nyq_qout) 			// Q output from Nyquist filter
{
	int32_t	i;
	int32_t	sumI, sumQ;
	vector<int32_t>::iterator	iter;

	*nyq_iout = 0 ;
	*nyq_qout = 0 ;
	sumI = sumQ = 0;

//	Compute the next filter output

	for (i = 0 ; i < coeff_len ; i++ ) {

		sumI += nyq_delay_i[i] * nyq_coeffs[i] ;
		sumQ += nyq_delay_q[i] * nyq_coeffs[i] ;
	}
	*nyq_iout = sumI;
	*nyq_qout = sumQ;

//	Shift the delay lines and put the next symbol in the delay lines

	iter = nyq_delay_i.begin();
	nyq_delay_i.insert(iter, i_symb);
	nyq_delay_i.pop_back();

	iter = nyq_delay_q.begin();
	nyq_delay_q.insert(iter, q_symb);
	nyq_delay_q.pop_back();
//	nyq_delay_i.push_back(i_symb);
//	nyq_delay_q.push_back(q_symb);
}


And here's the implementation with circular_buffer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
void wci_nyq_filt ( int32_t i_symb,				// I symbol input
					int32_t q_symb,				// Q symbol input
					int32_t *nyq_coeffs,		// Nyquist coefficients for I and Q
					int32_t coeff_len,			// Coefficient length
					boost::circular_buffer<int32_t> &nyq_delay_i,		// I delay line storage
					boost::circular_buffer<int32_t> &nyq_delay_q,		// Q delay line storage
					int32_t *nyq_iout,	 		// I output from Nyquist filter
					int32_t *nyq_qout) 			// Q output from Nyquist filter
{
	int32_t	i;
	int32_t	sumI, sumQ;
	boost::circular_buffer<int32_t>::iterator	iter;

	*nyq_iout = 0 ;
	*nyq_qout = 0 ;
	sumI = sumQ = 0;

//	Compute the next filter output

	for (i = 0 ; i < coeff_len ; i++ ) {

		sumI += nyq_delay_i[i] * nyq_coeffs[i] ;
		sumQ += nyq_delay_q[i] * nyq_coeffs[i] ;
	}
	*nyq_iout = sumI;
	*nyq_qout = sumQ;

//	Shift the delay lines and put the next symbol in the delay lines

//	iter = nyq_delay_i.begin();
//	nyq_delay_i.insert(iter, i_symb);
//	nyq_delay_i.pop_back();

//	iter = nyq_delay_q.begin();
//	nyq_delay_q.insert(iter, q_symb);
//	nyq_delay_q.pop_back();
	nyq_delay_i.push_back(i_symb);
	nyq_delay_q.push_back(q_symb);
}


Is there something inefficient in how I'm using the circular_buffers?

I'll try experimenting with 16-bit ints to see how much better that fares.
circular_buffer::operator[] will do an addition and a modulo operation. That introduces overhead.
In order to avoid that use the array_{one,two}() methods, that gives you a pair<pointer, size>
So you will need 2 loops now.

I've got ~1 second of improvement in 1<<20 iterations

By the way, your queues are different.
vector
0 0 0
1 0 0
2 1 0 
3 2 1

circular_buffer
0 0 0
0 0 1
0 1 2
1 2 3


Almost forgot, as COEFF_LEN is a power of 2 you are wasting memory with your vectors. Check out that their capacity doubles their size
1
2
3
  iter = nyq_delay_i.begin();
        nyq_delay_i.insert(iter, i_symb); //the culprit. Put it after the pop_back
        nyq_delay_i.pop_back();
Haven't timed it (just looked at the assembly), but these six lines seem to be about as fast as anything else:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <vector>
#include <numeric>
#include <cstdint>

// --std=c++0x -Wall -Wextra -Werror --pedantic -c -S -O3 -fomit-frame-pointer -march=core2

void baz( std::int32_t i_symb, std::int32_t q_symb,
           std::int32_t* nyq_coeffs,
           std::vector<std::int32_t>& nyq_delay_i,
           std::vector<std::int32_t>& nyq_delay_q,
           std::int32_t* nyq_iout, std::int32_t* nyq_qout )
{
    // compute filter I & Q outputs
    *nyq_iout = std::inner_product( nyq_delay_i.begin(), nyq_delay_i.end(), nyq_coeffs, 0 ) ;
    *nyq_qout = std::inner_product( nyq_delay_q.begin(), nyq_delay_q.end(), nyq_coeffs, 0 ) ;

    // shift delay lines (internally calls _memmove)
    std::copy( nyq_delay_i.begin(), nyq_delay_i.end()-1, nyq_delay_i.begin()+1 ) ;
    nyq_delay_i.front() = i_symb ;
    std::copy( nyq_delay_q.begin(), nyq_delay_q.end()-1, nyq_delay_q.begin()+1 ) ;
    nyq_delay_q.front() = q_symb ;
}
OutputIterator copy ( InputIterator first, InputIterator last, OutputIterator result ); result
Output iterator to the initial position in the destination sequence. This shall not point to any element in the range [first,last).


It is noticeable worse.
Yours ~10 seconds
mzimmers ~7.7
boost -O3 7.5
boost -O2 4.4 ¡¿?!

1<<20 iterations, COEFF_LEN=2048
nyq_coeffs random initialization
{i,q}_symb random generation
> OutputIterator copy ( InputIterator first, InputIterator last, OutputIterator result );
> result Output iterator to the initial position in the destination sequence.
> This shall not point to any element in the range [first,last).

Yup, thanks. Should be a std::uninitialized_copy()


> Yours ~10 seconds

With -march=core2?


> boost -O3 7.5
> boost -O2 4.4 ¡¿?!

-O3 taking 70% more time than -O2! I just can't believe that those results are with identical test data-sets for both cases, with the tests being run under identical conditions.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#include <vector>
#include <iostream>
#include <boost/circular_buffer.hpp>

void wci_nyq_filt_boost ( int32_t i_symb,				// I symbol input
		int32_t q_symb,				// Q symbol input
		int32_t *nyq_coeffs,		// Nyquist coefficients for I and Q
		//int32_t coeff_len,			// Coefficient length
		boost::circular_buffer<int32_t> &nyq_delay_i,		// I delay line storage
		boost::circular_buffer<int32_t> &nyq_delay_q,		// Q delay line storage
		int32_t *nyq_iout,	 		// I output from Nyquist filter
		int32_t *nyq_qout) 			// Q output from Nyquist filter
{
	int32_t sumI, sumQ;
	boost::circular_buffer<int32_t>::iterator	iter;

	*nyq_iout = 0 ;
	*nyq_qout = 0 ;
	sumI = sumQ = 0;

	int32_t 
		*arr_i = nyq_delay_i.array_one().first,
		*arr_q = nyq_delay_q.array_one().first;
	size_t size = nyq_delay_i.array_one().second;

	//Compute the next filter output
	size_t K;
	for (K=0 ; K < size ; K++ ) {
		sumI += arr_i[K] * nyq_coeffs[K];
		sumQ += arr_q[K] * nyq_coeffs[K];
	}
	arr_i = nyq_delay_i.array_two().first;
	arr_q = nyq_delay_q.array_two().first;
	size = nyq_delay_i.array_two().second;

	for (size_t L=0; L < size; L++,K++ ) {
		sumI += arr_i[L] * nyq_coeffs[K];
		sumQ += arr_q[L] * nyq_coeffs[K];
	}

	*nyq_iout = sumI;
	*nyq_qout = sumQ;

	nyq_delay_i.push_front(i_symb);
	nyq_delay_q.push_front(q_symb);
}
//the other functions

int main(int argc, char **argv){
	if(argc<2) return 1;
	enum algorithm{vector, circ_buffer, foo}; //to avoid recompilation
	algorithm choice;
	if( std::string(argv[1]) == "vector" ) choice=vector;
	else if( std::string(argv[1]) == "circ_buffer" ) choice=circ_buffer;
	else choice=foo;

	std::srand(42); //the same random data, always

	const size_t COEFF_LEN = 256*8;
	std::vector<int32_t> vec_delay_i(COEFF_LEN,0), vec_delay_q(COEFF_LEN,0);
	boost::circular_buffer<int32_t> bcb_delay_i(COEFF_LEN, COEFF_LEN, 0), bcb_delay_q(COEFF_LEN, COEFF_LEN, 0);

	int32_t nyq_coeffs[COEFF_LEN], nyq_out_i, nyq_out_q;

	for(size_t K=0; K<COEFF_LEN; ++K) //random initialization
		nyq_coeffs[K] = rand()%100;
	
	for(size_t K=0; K<1<<20; ++K){
		int32_t i_symb = rand()%1000, q_symb = rand()%1000; //random generation
		switch(choice){
		case vector:
			wci_nyq_filt_vector(i_symb, q_symb, 
				nyq_coeffs, COEFF_LEN,
				vec_delay_i, vec_delay_q, 
				&nyq_out_i, &nyq_out_q);
			break;
		case circ_buffer:
			wci_nyq_filt_boost(i_symb, q_symb,
				nyq_coeffs, //COEFF_LEN,
				bcb_delay_i, bcb_delay_q, 
				&nyq_out_i, &nyq_out_q);
			break;
		default:
			baz(i_symb, q_symb, 
				nyq_coeffs, 
				vec_delay_i, vec_delay_q, 
				&nyq_out_i, &nyq_out_q);
		}
	}

	std::cout << "Results " << nyq_out_i << ' ' << nyq_out_q << '\n'; 
	return 0;
}


$ g++ --std=c++0x -Wall -Wextra -Werror --pedantic -O3 -fomit-frame-pointer -march=core2 -o o3.out
$ g++ --std=c++0x -Wall -Wextra -Werror --pedantic -O2 -fomit-frame-pointer -march=core2 -o o2.out

Results above.
1
2
3
4
> std::srand(42); //the same random data, always
> ...
> $ g++ --std=c++0x -Wall -Wextra -Werror --pedantic -O3 -fomit-frame-pointer -march=core2 -o o3.out
> $ g++ --std=c++0x -Wall -Wextra -Werror --pedantic -O2 -fomit-frame-pointer -march=core2 -o o2.out

> Results above.

Hmm... So one lives and one learns.

Looked into the generated assembly; and the results were interesting:

Without the -march=core2 -O3 and -O2 generate almost identical non-SSE code.

With it, there is no significant change for -O2. It seems to just ignore SSE instructions.

-O3 generates all those 64 bit instructions (PMULUDQ and friends) with an extra 200 lines of assembly. This would certainly cause a performance hit when run on hardware without the extended instruction set (the code at runtime will call x86 builtin functions like v1di __builtin_ia32_pmuludq (v2si, v2si) - but on an actual Core2 or higher, aren't these instructions supposed to be faster?

This has turned into a pretty good discussion...I'm learning more than I'd counted on. I wish I'd been more scientific with my changes and measurements, but at least I know I'm moving in the right direction.

On the subject of inner_product: I implemented that on my Mac, and it compiled OK. So, I moved the file over to the PC, and I now get these errors:

In file included from c:\mingw64\bin\../lib/gcc/x86_64-w64-mingw32/4.6.1/include/c++/numeric:62:0,
from ..\HittiteSigGen\hittitesiggen_vectors.cpp:44:
c:\mingw64\bin\../lib/gcc/x86_64-w64-mingw32/4.6.1/include/c++/bits/stl_numeric.h: In function '_Tp std::inner_product(_InputIterator1, _InputIterator1, _InputIterator2, _Tp) [with _InputIterator1 = int*, _InputIterator2 = std::vector<int>, _Tp = int]':
..\HittiteSigGen\hittitesiggen_vectors.cpp:415:11: instantiated from here
c:\mingw64\bin\../lib/gcc/x86_64-w64-mingw32/4.6.1/include/c++/bits/stl_numeric.h:183:7: error: no match for 'operator++' in '++__first2'
c:\mingw64\bin\../lib/gcc/x86_64-w64-mingw32/4.6.1/include/c++/bits/stl_numeric.h:184:2: error: no match for 'operator*' in '*__first2'


Any ideas what could be causing this, and why it appears only on the Windows machine? Could it have something to do with the fact that I'm mixing vectors with arrays?
> Could it have something to do with the fact that I'm mixing vectors with arrays?

No. A pointer to an element in an array is an iterator.

Whar are you passing as the third parameter to std::inner_product()? (It should be an iterator to the beginning of the second sequence)
First of all, I lied: it is indeed happening on both systems. That's actually a good thing, I think.

I also went ahead and converted the array to a vector.

Here's the first part of the routine:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void wci_nyq_filt ( int32_t i_symb,				// I symbol input
					int32_t q_symb,				// Q symbol input
					vector<int32_t> &nyq_coeffs,	// Nyquist coefficients for I and Q
					int32_t coeff_len,			// Coefficient length
					vector<int32_t> &nyq_delay_i,		// I delay line storage
					vector<int32_t> &nyq_delay_q,		// Q delay line storage
					int32_t *nyq_iout,	 		// I output from Nyquist filter
					int32_t *nyq_qout) 			// Q output from Nyquist filter
{
	int32_t	i;
	int32_t	sumI, sumQ;
	vector<int32_t>::iterator	iter;

	*nyq_iout = 0 ;
	*nyq_qout = 0 ;
	sumI = sumQ = 0;

//	Compute the next filter output

	*nyq_iout = inner_product(nyq_coeffs,
							  nyq_coeffs + coeff_len,
							  nyq_delay_i,
							  0);
	*nyq_qout = inner_product(nyq_coeffs,
							  nyq_coeffs + coeff_len,
							  nyq_delay_q,
							  0);
Last edited on
Pages: 1234