Loop Unrolling with Parallel Accumulators

I need someone's help to examine the code I have written below. Please tell me if it is a proper example of four-way loop unrolling with four parallel accumulators.

Much thanks!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
void inner4(vec_ptr u, vec_ptr v, data_t *dest)
	{
	    long int i;
	    long int length = vec_length(u);
	    long int length = vec_length(v);
	    long int limit = length-3; //needed for unrolling a max of 4 at a time
	    data_t *udata = get_vec_start(u);
	    data_t *vdata = get_vec_start(v);
	    data_t sum = (data_t) 0; //these allow parallel accumulation
	    data_t sum1 = (data_t) 0; 
	    data_t sum2 = (data_t) 0;
	    data_t sum3 = (data_t) 0;

	    //unrolls 4 at a time
	    for (i = 0; i < limit; i += 4) {
	        sum += (udata[i] * vdata[i]);
		sum1 += (udata[i+1] * vdata[i+1]);
		sum2 += (udata[i+2] * vdata[i+2]);
		sum3 += (udata[i+3] * vdata[i+3]);
	    }
	    //increments by 1 until the limit is reached
	    for (; i < limit; i++) {
		sum = sum data[i];
	    }
	    *dest = sum + sum1 + sum2 + sum3;
	}
Last edited on
Usually I use the GPU for parallel programming, but it seems to be a good example.
I do notice that the second for loop would never execute. Given that the conditional for the first and second for loop is using the same variable and the conditionals are the same.
Ah. So I can take out the second for loop, and just move "*dest = sum + sum1 + sum2 + sum3;" under the first loop to get the same result.

Thanks for the input. It really helped!
Topic archived. No new replies allowed.