Speed up For Loop with.pushback

I'm trying to speed up the following codes:

1
2
3
4
5
6
7
8
9
10
11
12
	for(int curRow=0; curRow<numRows; ++curRow){  // long loop
		int lb = csr_rowIndices[curRow];
		int ub = csr_rowIndices[curRow+1];

		for(int curIndex=lb; curIndex<ub; ++curIndex){
			int curCol = csr_indices[curIndex];
			float curVal = csr_val[curIndex];

			val[curCol].push_back(curVal);
			indices[curCol].push_back(curRow);
		}
	}


I've a GPU capable workstation and am considering to CUDA program it. my concern is the ".pushback" function that may prevent it from doing it correctly.

Please advise me of any solutions. Thank you in advance.
Last edited on
Did you pre-allocate the vector before the loop? Use .reserve()
and a little heresy ... if you know the exact size, and its time critical, consider array if you don't need any algorithms or other features of a vector.

also, consider: take the ints out of the loop. Also for curcol.
Or, if you want them in the loop, consider making them const or register or both. Time the loop with all these, see which one your compiler prefers.

int lb =
in ub =

instead let it be
lb =
ub =


like this
{ //scope the ints.
int lb, ub;
for()...
{
lb = ..
ub =...
}
} //end of scope for the ints.


finally look into your specific work to see if you can unroll the loop into 1 loop instead of 2 nested. This can often be done, and it will help.
Last edited on
You don't show the declaration for either val or indices, so I can only guess as to their type.
I'm guessing they are vectors, although they could be any container that supports push_back (list, stack, deque).

If they are indeed vectors, and you know the maximum size in advance, be sure to use vector::reserve() to establish the maximum size of the vector in advance. This will eliminate the vector getting resized (expensive) periodically as elements as pushed onto it.

PLEASE ALWAYS USE CODE TAGS (the <> formatting button) when posting code.
It makes it easier to read your code and also easier to respond to your post.
http://www.cplusplus.com/articles/jEywvCM9/
Hint: You can edit your post, highlight your code and press the <> formatting button.


Time the loop with all these, see which one your compiler prefers.

If it prefers one or the other, get a new compiler.
is allocation of variables inside a loop free now? I havent done anything timed since I started trying to get back into it...
is allocation of variables inside a loop free now?

If it's a POD, it always has been.
When I was doing real time inner loops, which is admittedly a decade ago, it was notably faster. Making them const or moving them out of the loop would prevent it from flushing the value back to the cache, and allocation /deallocation is an assembler push & pop pair. If that has been handled by smarter compilers, it would be nice.
When I was doing real time inner loops, which is admittedly a decade ago, it was notably faster.

I can't speak for decades old compilers, but there has never been a reason for a difference in performance in either C or C++ for this specific situation (typically allocation for stack variables is done once at the beginning of the function if it is even needed, regardless of where the variables appear, and deallocation is done once at function end.) There should literally be no difference in the code emitted by your compiler, although doing so with non-POD types is likely to engender a significant performance/code generation difference.
btw OP, you're mostly likely to have an out-of-bounds error here:
 
int ub = csr_rowIndices[curRow+1];
when curRow ==numRows
cire is correct. I did a number of tests and there is no effect with my suggestions. I will do a strike out edit on those.

What do you guys think about using the curCol and curVal values directly, and not storing them in an int at all? Better or worse? Curval is only used 1 time, so that one I think would be good. CurCol is used twice, so it may be better as-is, but I am not sure.


Last edited on
What do you guys think about using the curCol and curVal values directly, and not storing them in an int at all?

With full optimization, it should make no difference. An optimizing compile will assign those variables to registers.
It's a small change but it might help. This assumes that numRows>0:
1
2
3
4
5
6
7
8
9
10
11
12
    int lb = csr_rowIndices[0];
    for(int curRow=0; curRow<numRows; lb=ub, ++curRow){ // long loop
        int ub = csr_rowIndices[curRow+1];

        for(int curIndex=lb; curIndex<ub; ++curIndex){
            int curCol = csr_indices[curIndex];
            float curVal = csr_val[curIndex];

            val[curCol].push_back(curVal);
            indices[curCol].push_back(curRow);
        }
    }
Last edited on
Thanks for the suggestions all; it's heartening to see so many helpful people here.

I'll try out the modifications mentioned.

On another hand, is possible to use OpenMP or CUDA to parallel the code?

I've tried #pragma omp parallel for private(curRow) shared(curCol, curVal) to no effect... is there anything I've missed?

@dhayden

I've tried your suggestion but had no effect in timing. Thanks for the tip. Much appreciated.
if you can flip it to be a single loop it would be easier to make it parallel...

It may not be possible because the innner loop is controlled by data.
looking.

tell me about

csr_rowIndices[curRow];
csr_rowIndices[curRow+1];

whats in there?

also, it may be best if you thread (or otherwise parallel process) the inner loop, where each outer loop spawns a thread of the inner? IIRC the GPUS have hundreds or thousands of RISC cores, right -- top end cards pushing over 2k now?
Last edited on
@jonnin


For context, the overall function that i'm trying to parallel is the conversion of CSR to CSC (Compressed Sparse Row to Compressed Sparse Column) formats.

This segment of code is meant to create vectors to contain the columns number and the corresponding value.

csr_rowIndices is a vector that contains the index of the rows

Therefore csr_rowIndices[curRow] is placed as the lower bound (lb) of the previous conversion loop segment which tells the inner loop where to start. csr_rowIndices[curRow+1] is the upper bound of the next conversion loop that means to say it is where the inner loop should stop.

Hope that gives you more information. Thank you.

And yes, GPUs have many cores which is what i'm considering to use to make the code run faster.
Last edited on
I don't know that you can even make that parallel. If the inner loop depends on the previous iteration of the same inner loop, I don't see a way to run that concurrently. If you can "guess" at the parameters for the inner loop at the outer loop, you can spawn 1000 or so and see which result to use, and repeat. But I don't know if your problem lends itself to that sort of approach or not.

Can you take a sloppy approach, is there a lowest lower bound and highest upper bound such that you get the right result splitting it across 2k cores but doing more work on each core than necessary? That is still going to beat waiting on each iteration with tighter bounds .... ?




@Jonnin

I wonder if the following preallocation code is right:

1
2
	vector< vector<float> > vec_L_val(numRows), vec_U_val(numRows);
	vector< vector<int> > vec_L_indices(numRows), vec_U_indices(numRows);


perhaps instead of numRows, I should put NNZ (number of non-zeros). However, that make the .push_back redundant and means I'd need to mod the code. Hope my changes won't skew the code too much...

Last edited on
I'm thinking:

Perhaps we can schedule the outer loop to serve for each value of curRow.

IE. Threads 0 to 5 will process curRow [0 to 5] then thereafter the next 6 rows [6 to 11] and so forth wherein the inner loop appends the next pushback value in the respective vectors.

Is there anyway we can schedule it as such? maybe we can try chunking it.
you can schedule or split it this way, sure, all you have to have is for each chunk of work, you need the data it will operate on. If that data is produced later, you have to wait for it, and you can do that also. Waiting, however, destroys most of the gains of the parallelism.

I am not familiar with cuda work. We had one, but didnt use it much, gpu took too much power and we were on a battery/embedded system. I ran a beowulf for a while, but I was just playing. Most of my parallel work is just quadcore tweaks on modern cpus. That, I have done quite a bit of.

Last edited on
Topic archived. No new replies allowed.