well, that sure didn't work very well...

Pages: 1234
...I'm maintaining a program written by someone else. It's highly compute-intensive and fairly slow. I found this routine:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
void wci_nyq_filt ( int32_t i_symb,				// I symbol input
					int32_t q_symb,				// Q symbol input
					int32_t *nyq_coeffs,		// Nyquist coefficients for I and Q
					int32_t coeff_len,			// Coefficient length
					int32_t *nyq_delay_i,		// I delay line storage
					int32_t *nyq_delay_q,		// Q delay line storage
					int32_t *nyq_iout,	 		// I output from Nyquist filter
					int32_t *nyq_qout) 			// Q output from Nyquist filter
{
	int32_t	i, j ;

	*nyq_iout = 0 ;
	*nyq_qout = 0 ;

	//	Compute the next filter output

	for (i = 0 ; i < coeff_len ; i++ ) {

		*nyq_iout += nyq_delay_i[i] * nyq_coeffs[i] ;
		*nyq_qout += nyq_delay_q[i] * nyq_coeffs[i] ;

		//		printf ("I = %ld Nyq_Iout = %8ld Nyq_Idelay = %8ld, Nyq_Coeff = %8ld\n",
		//			     i, *nyq_iout, nyq_delay_i[i], nyq_coeffs[i] ) ;
	}

	//	Shift the delay lines and put the next symbol in the delay lines

	for (i = 1 ; i < coeff_len ; i++ ) {

		j = coeff_len - i ;
		nyq_delay_i[j] = nyq_delay_i[j-1] ;
		nyq_delay_q[j] = nyq_delay_q[j-1] ;
	}

	nyq_delay_i[0] = i_symb ;
	nyq_delay_q[0] = q_symb ;
}


and thought that it might be sped up with vectors, thereby eliminating the shift in the array. I changed the routine to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void wci_nyq_filt ( int32_t i_symb,				// I symbol input
					int32_t q_symb,				// Q symbol input
					int32_t *nyq_coeffs,		// Nyquist coefficients for I and Q
					vector<int32_t> &nyq_delay_i,		// I delay line storage
					vector<int32_t> &nyq_delay_q,		// Q delay line storage
					int32_t *nyq_iout,	 		// I output from Nyquist filter
					int32_t *nyq_qout) 			// Q output from Nyquist filter
{
	int32_t	i;
	vector<int32_t>::iterator	iter;
	*nyq_iout = 0 ;
	*nyq_qout = 0 ;

	//	Compute the next filter output
	i = 0;
	for (iter = nyq_delay_i.begin(); iter != nyq_delay_i.end() ; ++iter )
	{
		*nyq_iout += nyq_delay_i[i] * nyq_coeffs[i] ;
		*nyq_qout += nyq_delay_q[i] * nyq_coeffs[i] ;
		i++;
	}


Not only didn't it speed it up, it slowed it down by over a factor of 2.

Any ideas why?
I assume that i_symb & q_symb are still being added to the fronts of vectors nyq_delay_i & nyq_delay_q respectively and the backs of these vectors are also being popped in order to emulate the code in your first block.

If this is so, then I think that a shift by copying into a new iterator might still be occuring under the bonet of the vector class (seems so for my version). This will then negate the performance you were hoping to gain by avoiding that step manuallly yourself.

Can you change your vectors to lists and then the inserting at front and popping off backs of these lists should not suffer the same shift copy as is the case with the vectors.

Also change your for-loop for computing next filter output to:

1
2
3
4
5
6
	for (iter = nyq_delay_i.begin(); iter != nyq_delay_i.end() ; ++iter )
	{
		*nyq_iout += *iter * nyq_coeffs[i] ;
		*nyq_qout += *iter * nyq_coeffs[i] ;
		i++;
	}


I think derefencing the iterator may be much faster than dereferencing at item via the containers [] operator, especially for the list class.

Does this help with the performce?
> Any ideas why?

Compiler optimizations (in particular inlining) may not have been enabled.

You might be using Microsoft's implementation with checked iterators.

We can only guess.
I agree with the idea of using a list instead of an array-like structure, if random access isn't required elsewhere. A circular buffer might also be an alternative, depending on how large coeff_len is.
A list? I would expect that a std::list<> would be measurably slower than a std::vector<>. A std::deque<> would do quite nicely.

Of course, a circular buffer would give the best performance.




But OP is only accessing the array in order from beginning to end, in the snippet he posted. A vector offers no advantage over a list if this is the only access pattern being used. On the other hand, the shifting loop is a massive waste of time.
> A vector offers no advantage over a list if this is the only access pattern being used.

A vector offers a massive advantage over a list - locality of data, resulting in cache hits rather than cache misses during iteration from beginning to end. In typical implementations, a vector comprehensively out-performs a list for sequential access.
OK, I just did a bit more investigation. I tried arrays, vectors, deques and lists. I also realized that I was running programs built in debug mode, so I switched it to release mode. Here are the results:

arrays: 145 seconds to complete
vectors: 216
lists: 374
deques: 395

So, running in release mode clearly closed the gap between arrays and vectors, but arrays still have a big edge. I also implemented the loop modification suggested above by SIK; that shaved one second off the time (hey, better than nothing).

I'm not sure what my compiler options are set to: I'm using Qt Creator, and those are buried somewhere I can't find.

I did notice, though, that I'm using the 32-bit version of MinGW. I'm looking into the 64-bit version, but I may have to build Qt from source to use it...I'd rather not hassle with that.

Anyway, the results are interesting...I really didn't expect such a hit from the vectors, let alone an incremental hit from the others.
I presume that for lists and deques, you replaced the Shift the delay lines loop with a push_front/pop_back sequence.
Oh, crud...I just realized that in my original post, I left out what I'd replaced the shifts with:

1
2
3
4
5
6
7
8
9
	iter = nyq_delay_i.begin();
	nyq_delay_i.insert(iter, i_symb);
	if (nyq_delay_i.size() > (unsigned) COEFF_LEN)
		nyq_delay_i.pop_back();

	iter = nyq_delay_q.begin();
	nyq_delay_q.insert(iter, q_symb);
	if (nyq_delay_q.size() > (unsigned) COEFF_LEN)
		nyq_delay_q.pop_back();


Sorry about that. This code is identical for vectors, deques and lists.
Hmm... I'm quite disappointed with the deque performance; didn't expect anything good from the list anyway; and unsurprised that the vector with subscript operator is slower than an array while using GCC.

I had done this a few months back (GCC 4.6, IIRC):
1
2
3
4
5
6
7
8
9
10
11
12
13
#include <vector>
#include <algorithm>

void one( std::vector<int>& seq ) { for( auto& i : seq ) ++i ; }

void two( std::vector<int>& seq )
{ for( auto iter = seq.begin() ; iter != seq.end() ; ++iter ) ++*iter ; }

void three( std::vector<int>& seq )
{ for( std::vector<int>::size_type i = 0 ; i < seq.size() ; ++i ) ++seq[i] ; }

void four( std::vector<int>& seq )
{ std::for_each( seq.begin(), seq.end(), []( int& i ) { ++i ; } ) ; }


It is clearly a quality of implementation issue; but one could expect the generated code to be almost identical for all four. Well not quite.

The GNU compiler generates identical code for one, two and four, but trips up on three (array subscript).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#include <vector>
#include <algorithm>

/**********************************************************************
>g++ --std=c++0x -Wall -Wextra -Werror --pedantic -c -S -O3 -fomit-frame-pointer
**********************************************************************/


void one( std::vector<int>& seq )
{ for( auto& i : seq ) ++i ; }
/*********************************************************************
    movl    4(%esp), %edx
    movl    (%edx), %eax
    movl    4(%edx), %edx
    cmpl    %edx, %eax
    je  L1
L3:
    addl    $1, (%eax)
    addl    $4, %eax
    cmpl    %eax, %edx
    jne L3
L1:
    rep
    ret
**********************************************************************/


void two( std::vector<int>& seq )
{ for( auto iter = seq.begin() ; iter != seq.end() ; ++iter ) ++*iter ; }
/*********************************************************************
    movl    4(%esp), %edx
    movl    (%edx), %eax
    movl    4(%edx), %edx
    cmpl    %edx, %eax
    je  L6
L8:
    addl    $1, (%eax)
    addl    $4, %eax
    cmpl    %eax, %edx
    jne L8
L6:
    rep
    ret
**********************************************************************/


void three( std::vector<int>& seq )
{ for( std::vector<int>::size_type i = 0 ; i < seq.size() ; ++i ) ++seq[i] ; }
/*********************************************************************
    movl    4(%esp), %eax
    movl    (%eax), %edx
    movl    4(%eax), %ecx
    subl    %edx, %ecx
    sarl    $2, %ecx
    testl   %ecx, %ecx
    je  L10
    sall    $2, %ecx
    xorl    %eax, %eax
L12:
    addl    $1, (%edx,%eax)
    addl    $4, %eax
    cmpl    %ecx, %eax
    jne L12
L10:
    rep
    ret
**********************************************************************/


void four( std::vector<int>& seq )
{ std::for_each( seq.begin(), seq.end(), []( int& i ) { ++i ; } ) ; }
/*********************************************************************
    movl    4(%esp), %eax
    movl    4(%eax), %edx
    movl    (%eax), %eax
    cmpl    %eax, %edx
    je  L17
L19:
    addl    $1, (%eax)
    addl    $4, %eax
    cmpl    %eax, %edx
    jne L19
L17:
    rep
    ret
**********************************************************************/

@mzimmers: is the times still relatively proportional with the extra step included, ie - does the vector degrade compare to the list then due to shift copying of items?
I'm not sure what my compiler options are set to: I'm using Qt Creator, and those are buried somewhere I can't find.
Check out if you can generate a makefile and look for CXXFLAGS there.
I want to believe that your test is not biased by the construction of the container.

1
2
3
4
	iter = nyq_delay_i.begin();
	nyq_delay_i.insert(iter, i_symb);
	if (nyq_delay_i.size() > (unsigned) COEFF_LEN)
		nyq_delay_i.pop_back();
¿why aren't you shifting it always?
¿what is the value of COEFF_LEN?
¿how many times are you runnning wci_nyq_filt() ?


Now implement the queue and reduce the order.
Last edited on

The GNU compiler generates identical code for one, two and four, but trips up on three (array subscript).


Wouldn't that be because the subscript operator actually needs to check the index?
> Wouldn't that be because the subscript operator actually needs to check the index?

No, the subscript operator of the vector does not do any checking. std::vector<>::at() does.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void wci_nyq_filt ( int32_t i_symb,				// I symbol input
					int32_t q_symb,				// Q symbol input
					int32_t *nyq_coeffs,		// Nyquist coefficients for I and Q
					vector<int32_t> &nyq_delay_i,		// I delay line storage
					vector<int32_t> &nyq_delay_q,		// Q delay line storage
					int32_t *nyq_iout,	 		// I output from Nyquist filter
					int32_t *nyq_qout) 			// Q output from Nyquist filter
{
	int32_t	i;
	vector<int32_t>::iterator	iter;
	*nyq_iout = 0 ;
	*nyq_qout = 0 ;

	//	Compute the next filter output
	i = 0;
	for (iter = nyq_delay_i.begin(); iter != nyq_delay_i.end() ; ++iter )
	{
		*nyq_iout += nyq_delay_i[i] * nyq_coeffs[i] ;
		*nyq_qout += nyq_delay_q[i] * nyq_coeffs[i] ;
		i++;
	}


Assuming the vector starts at 0, couldn't you use:

1
2
3
4
5
6
7
int i_vector_size = nyq_delay_i.size();

for( i = 0; i < i_vector_size; i++ )
{
    nyq_iout += nyq_delay_i[ i ] * nyq_coeffs[i];
    nyq_qout += nyq_delay_q[i] * nyq_coeffs[i] ;
}


You would eliminate the extra op for the increment of a separate variable for indexing and indexing is simply folded into the loop.

One other, not sure it will help and I can't test it without some real world test data, indexed addressing is almost always slower than absolute. It may or may not be a performance gain to assign the indexed elements to a locally assigned variable before the arithmetic.

I would need to test it myself before I stood by the idea, but when I worked on older procs (8088 and such), it was a valid method for seeing performance increases. I would also like to see the assembler code between the two methodologies.

I have one other idea, but I don't think it would go over well with the C++ crowd, but here it is: there is a reason programmers don't use highly abstracted types in inner loops. Where ultra high performance is needed, you get lower level. I would never use an STL container in an inner loop for something truly high performance. It may be harder to maintain and require more work to make small changes, but were talking about areas where performance matters more and the extra hassle is worth it.

Last edited on
SIK: I'm not sure I understand your question. Deques, lists, and vectors were tested with identical code. The only difference was a global replace of the STL class being used. I just forgot to copy that extra bit of code when posting it here.

ne555: the routine does always shift the container. When it gets to a certain size, though, I don't want it to get any bigger. The original program had pre-allocated the size of the array.

COEFF_LEN is 256 * 8, and this routine is called 2 << 24 times.

Roberts: thanks for the suggestion on the loop modification. I might try that a little later today. What "low level" technique might you suggest for the inner loop?
Leave it as arrays, then possibly assign the indexed array element to a local.

You'll need to test with a hi-rez timer to see if you gain any performance. Look into the subject of unrolling loops.

Is coeff_len any number? Odd or even? How about nyq_coeffs? Are the coeffecients odd, even, either? Where you can multiply/divide by a factor of two, you're better off bit shifting.

I looked up Nyquist to get a very superficial understanding of what this is about, but the actual algorithm is mystery to me.

You're stepping into an area of programming where you're trying to cut ops and microseconds. If you continue down this road, you'll find you'll be doing a lot of revision, testing and back again. I hope you have a good RCS.

Is there anyway you can provide some source and test data to me? I actually love this side of programming. Come to the dark side, Luke.
Last edited on
> Assuming the vector starts at 0, couldn't you use:
> ...
> You would eliminate the extra op for the increment of a separate variable for indexing
> and indexing is simply folded into the loop.

We can safely assume that if we know that, a decent compiler would also know that and apply the same optimization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
void foo( std::vector<int>& a, std::vector<int>& b,
           int* coeffs, int* x, int* y )
{
    *x = 0 ;
    *y = 0 ;

    std::size_t i = 0 ;

    for( auto iter = a.begin() ; iter != a.end() ; ++iter )
    {
        *x += a[i] * coeffs[i] ;
        *y += b[i] * coeffs[i] ;
        ++i ;
    }
}


with --std=c++11 -Wall -Wextra -Werror --pedantic -c -S -O3 -fomit-frame-pointer generated:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
__Z3fooRSt6vectorIiSaIiEES2_PiS3_S3_:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$4, %esp
	movl	24(%esp), %eax
	movl	36(%esp), %ecx
	movl	40(%esp), %ebx
	movl	32(%esp), %esi
	movl	(%eax), %edi
	movl	4(%eax), %eax
	movl	$0, (%ecx)
	movl	$0, (%ebx)
	cmpl	%eax, %edi
	je	L1
	movl	28(%esp), %edx
	movl	(%edx), %ebp
	leal	4(%edi), %edx
	subl	%edx, %eax
	shrl	$2, %eax
	addl	$1, %eax
	movl	%eax, (%esp)
	xorl	%eax, %eax
	.p2align 4,,7
L3:
	movl	(%edi,%eax,4), %edx
	imull	(%esi,%eax,4), %edx
	addl	%edx, (%ecx)
	movl	0(%ebp,%eax,4), %edx
	imull	(%esi,%eax,4), %edx
	addl	$1, %eax
	addl	%edx, (%ebx)
	cmpl	(%esp), %eax
	jne	L3
L1:
	addl	$4, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret


1
2
3
4
5
6
7
8
9
10
11
12
13
14
void bar( std::vector<int>& a, std::vector<int>& b,
          int* coeffs, int* x, int* y )
{
    *x = 0 ;
    *y = 0 ;

    std::size_t i_vector_size = a.size();

    for( std::size_t i = 0; i < i_vector_size; i++ )
    {
        *x += a[i] * coeffs[i] ;
        *y += b[i] * coeffs[i] ;
    }
}


With the same compiler options generated identical code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
__Z3barRSt6vectorIiSaIiEES2_PiS3_S3_:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$4, %esp
	movl	24(%esp), %eax
	movl	36(%esp), %ecx
	movl	40(%esp), %ebx
	movl	32(%esp), %esi
	movl	(%eax), %ebp
	movl	4(%eax), %edi
	movl	$0, (%ecx)
	movl	$0, (%ebx)
	subl	%ebp, %edi
	sarl	$2, %edi
	testl	%edi, %edi
	je	L8
	movl	28(%esp), %eax
	movl	%edi, (%esp)
	movl	(%eax), %edx
	xorl	%eax, %eax
	movl	%edx, %edi
	.p2align 4,,7
L10:
	movl	0(%ebp,%eax,4), %edx
	imull	(%esi,%eax,4), %edx
	addl	%edx, (%ecx)
	movl	(%edi,%eax,4), %edx
	imull	(%esi,%eax,4), %edx
	addl	$1, %eax
	addl	%edx, (%ebx)
	cmpl	(%esp), %eax
	jne	L10
L8:
	addl	$4, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret



> What "low level" technique might you suggest for the inner loop?

In general, the compiler would do a better job of "low level" optimization than the typical programmer. This "high level" code (which does the same thing):

1
2
3
4
5
6
void baz( std::vector<int>& a, std::vector<int>& b,
          int* coeffs, int* x, int* y )
{
    *x = std::inner_product( a.begin(), a.end(), coeffs, 0 ) ;
    *y = std::inner_product( b.begin(), b.end(), coeffs, 0 ) ;
}


With the same compiler options generated this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
__Z3bazRSt6vectorIiSaIiEES2_PiS3_S3_:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	movl	20(%esp), %eax
	movl	24(%esp), %edi
	movl	28(%esp), %ecx
	movl	4(%eax), %ebp
	movl	(%eax), %eax
	cmpl	%eax, %ebp
	je	L19
	movl	%ecx, %ebx
	xorl	%esi, %esi
	.p2align 4,,7
L16:
	movl	(%eax), %edx
	addl	$4, %eax
	imull	(%ebx), %edx
	addl	$4, %ebx
	addl	%edx, %esi
	cmpl	%eax, %ebp
	jne	L16
L15:
	movl	32(%esp), %eax
	xorl	%ebx, %ebx
	movl	%esi, (%eax)
	movl	4(%edi), %esi
	movl	(%edi), %eax
	cmpl	%eax, %esi
	je	L17
	.p2align 4,,7
L18:
	movl	(%eax), %edx
	addl	$4, %eax
	imull	(%ecx), %edx
	addl	$4, %ecx
	addl	%edx, %ebx
	cmpl	%eax, %esi
	jne	L18
L17:
	movl	36(%esp), %eax
	movl	%ebx, (%eax)
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
L19:
	xorl	%esi, %esi
	jmp	L15


Yes, marginally slower - two loops, one for each call to std::inner_product() - instead of one.

"Low level" techniques typically end up obfuscating the code, and nothing more.

You would be much better off, thinking about "high level" design decisions - decisions that the compiler can't make for you. For instance, use a circular buffer implementation. And for using it, opt for the "highest level" that you can conceive of - for instance a boost::circular_buffer<> http://www.boost.org/doc/libs/1_49_0/libs/circular_buffer/doc/circular_buffer.html

Last edited on
JL, it's been about a decade at least since I worked with ASM.

What does the registers enclosed in parens mean?
Pages: 1234