How to increase speed of large for loops

Right now i'm trying to run very large for loops for some task, nearly about 8e+12 iterations. I tried using c++11 threading, but it do not seems to be working that fast as required. I am using system with 8 gb ram, i5 cpu and intel graphics 4000 card. If i use openmp would it be better or i have to use nvidia gpu and use cuda for this task? My code is as below:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void thread_function(pcl::PointCloud<pcl::PointXYZRGB>::ConstPtr cloudB,vector<int> v,int p0) {
    for(size_t p1=0;p1<v.size() && ros::ok();++p1) {
        int p0p1 = sqrt( pow(cloudB->points[v[p1]].x-cloudB->points[v[p0]].x,2)                
        +pow(cloudB->points[v[p1]].y-cloudB->points[v[p0]].y,2)
        +pow(cloudB->points[v[p1]].z-cloudB->points[v[p0]].z,2) ) * 1000;
        if(p0p1>10) {
            for(size_t p2=0;p2<v.size() && ros::ok();++p2) {
                int p0p2 = sqrt( pow(cloudB->points[v[p2]].x-cloudB->points[v[p0]].x,2)                
                +pow(cloudB->points[v[p2]].y-cloudB->points[v[p0]].y,2)
                +pow(cloudB->points[v[p2]].z-cloudB->points[v[p0]].z,2) ) * 1000;
                int p1p2 = sqrt( pow(cloudB->points[v[p2]].x-cloudB->points[v[p1]].x,2)                
                +pow(cloudB->points[v[p2]].y-cloudB->points[v[p1]].y,2)
                +pow(cloudB->points[v[p2]].z-cloudB->points[v[p1]].z,2) ) * 1000;
                if(p0p2>10 && p1p2>10) {
                }    
            }    
        }       
    }    
    x[p0] = 3;
    cout<<"ended thread="<<p0<<endl;
}


This task is really important for my algorithm to complete. I need a suggestion how to make this loops run very fast. In above code the thread_function is the main function where i'm putting the for loops currentely. Is their any way to increase its performance in above code?
Last edited on
pow is notoriously slow for integer powers. use x*x instead of pow(x,2) and you will gain a fair bit back.

threading, cuda and such may help, but get the single threaded code as tight as possible before you go there as it is much more difficult to debug/tune multi threaded or exotic stuff.

this condition may cost you, but does nothing.
if(p0p2>10 && p1p2>10) {
}
can you preserve any of the computations across iterations?
is p0 and p1 followed by p1 and p2, so that some of the p1 computations are doubled up? Not sure at a glance.

it looks like you are doing some sort of distance formula. can you get away using the squared values, instead? eg if p0p2 > 100 instead of if sqrt of it > 10?

are you page faulting? again looking at it kinda fast but you may be better off if your data structures were transposed so you can iterate memory sequentially, where you are using the 2-d [[]] notation? Check that to make sure it is going across memory linearly.
Last edited on
Don't use sqrt()

If the item square-rooted is > 10, then the original item is > 100.

Do a quick check on the individual coordinates only, before multiplying.
e.g. if abs(x) > 10 then you know perfectly well that x*x + y*y + z*z > 10*10 without any more effort.
A few questions:
1. How large is v.size()?
2. How many times will thread_function() be called? Once per processor?
3. What does v do? Do you use it to schedule work for the threads?
4. Overall, what's the function doing? I see that you're checking that various points are at a minimum distance, but I find it hard to keep track of the indirection.

A few suggestions that might improve things marginally:
1. Since you're checking for minimum distances, instead of doing this:
1
2
int d = sqrt(/*...*/);
if (d > 10){ /*...*/ }
Do this:
1
2
int d = /*...*/;
if (d > 100){ /*...*/ }

2. Like jonnin said, replace the pow(x, 2) calls with x*x. Even this would be faster:
1
2
3
4
template <typename T>
T square(const T &x){
    return x * x;
}

3. What are the types of the x, y, and z members that you're using in the distance calculation? Are they floats or integers? Converting floats to integers is quite expensive, so if the members are floats you can improve things a bit by changing the types of p0p1, p0p2, and p1p2 to float or double.
4. What does ros::ok() do? Do you need to check it on every single iteration?
5. Pass v by reference, or by std::unique_ptr. Although this won't help much unless v is truly gigantic.

There's nothing obviously wasteful about this code. I don't expect he changes I mentioned in the previous paragraph to save you more than 10% of the time. Ultimately, the problem seems to me that you're just doing a lot of iterations. 8 trillion is a big number.
I wouldn't bother trying to run this in CUDA; not the way it's written here, anyway. There may be some way to rewrite the algorithm to make it GPU-friendly, though.
You can probably cut down on the looping substantially by dealing only with the items that are far from p0:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
vector<size_t> gt10; // indexes of items whose distance to v[0]>10
for (size_t i=1; i<v.size() && ros::ok(); ++i) {
    if (distance(v[i], v[0) > 10) {
        gt10.push_back(i);
    }
}

// Now look for pairs that are also far apart
for (size_t i=0; i<v.size(); ++i) {
    for (size_t j=i+1; j<v.size(); ++j) {
        if (distance(v[i],v[j]) > 10) {
            // Do whatever you're doing
        }
    }
}
Hi,

Are you aware of std::hypot ?

https://en.cppreference.com/w/cpp/numeric/math/hypot

What is the purpose of the function? We might be able to come up with a better solution if we knew what you actually want to do. This is an XY problem, we have your solution Y, but we don't know what X is.

BTW, I am an Engineering and Mining Surveyor, I deal with Point Clouds all the time.

Why does the inner for loop process distances to all the items in v? That's O(n2) right there. If you can find a way to do it in at least O(n) you would be down to millions rather than trillions.

There might be better data structures to store your data, I remember old software was slow with point clouds because it stored all the data in one giant array. One such data structure that could be better, (depending on what you are doing), is the Tile Structure which is a specialization of an R-Tree (Rectangular Tree). Basically if one can put data into smaller Tiles, then the look up is greatly reduced.

You might also benefit from decoupling PointXYZRGB into XYZ and RGB, do COGO with the XYZ, look up RGB when you need it.

What is the function supposed to produce? At the moment it produces nothing (apart from the trivial line 20), or was that the result of dumbing it down to ask the question?

It also seems odd that you are interested in points that are further away than 10 metres. That seems backwards: point clouds typically have millions of points and can cover large areas, I have dealt with LIDAR data which covered 70km2. Why have an interest in all points at the other end of the survey, from each of the points in the whole survey? More bizarre is that you truncate to the nearest metre because of the use of the integer type.

So in short, please supply a description of what it is you are doing exactly, so we can help you better :+)
I thought hypot favored accuracy over speed though. I still think avoiding the root entirely is the better answer here. Our OP seems to have lost interest already .
Last edited on
Hi jonnin,

I think the OP's problem are:

The algorithm;
The data structure;
Types and references.

In my experience, problems like this do require accuracy, I don't see it as being a problem with the right data structure. I am not convinced the OP does need to calculate all those distances.

For example, there are folks who use LIDAR data to produce whole map sheets with contours (a computationally and data expensive operation), say about 100km by 50km. Using about 50,000 points/km2 which is what I had with my data - 3.4million points for 70 km2, one could create tiles of 1 km2, and blocks 10km by 10km, then map sheets 100km by 50km, then cover the whole country with districts, states and zones. That way one only has to process 50,000 points at a time to do arbitrarily large areas.

Regards :+)
I agree, it looks to be doing too much work. But without some feedback we are just guessing, educated or not.

The data structure is likely; page faults are probably an issue in this thing.
Topic archived. No new replies allowed.