help with omp

Hi there,

I am new in OMP programming. I am trying to parallelize the code, but it works very slowly. I am using intel icpc compiler.
I would appreciate very much if someone can help me to properly parallelize the code.
Thanks in advance.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
   int k, i, j, l, numOfThreads ;
 numOfThreads  = 10;
 // m = number of animals with genotypes;
 // n = total number of animals
 // Ped = matrix(n,2); Ped(i,1) = father; Ped(i,2) = mother
 // Rhs = vector of size n having all zeroes and 1 for the animal of interest
 // indx1(m) = vector containing all genotyped animals
 // Sol(n) = vector containing interim solutions
 // v(n) = vector containing final values of the matrix
 // Asub(m,m) = matrix containg values for the genotyped animals
 // all vectors and matrices are Blitz arrays

 #pragma omp parallel private(i, j, k) shared(m,n,Ped,D,indx1, Sol, Rhs, Asub,v)
 #pragma omp parallel for num_threads(numOfThreads)
{

  for (k = 0; k < m; k++) {  // this is the main loop
    Rhs = 0.0;
    Rhs(indx1(k)) = 1.0;
    Sol = 0.0;

    for (i = n-1; i > -1; i--) { // this is the second loop wwhere we add values to the relatives
      Sol(i) += Rhs(i);
      if (Ped(i,1) > 0) {  // check for father
	Sol(Ped(i,1)) += Sol(i)*0.5;
      }
      if (Ped(i,2) > 0) {      // check for mother
	Sol(Ped(i,2)) += Sol(i)*0.5;
      }
    }

    v = 0.0;
  for (j = 0; j < n; j++) {   // this is the third loop; we calculate the values of the matrix
    if (Ped(j,1) == 0 and Ped(j,2) == 0) {
      v(j) = D(j)*Sol(j);
    }
    if (Ped(j,1) > 0 and Ped(j,2) > 0) {
      v(j) = D(j)*Sol(j) + (v(Ped(j,1)) + v(Ped(j,2)))*0.5;
    }
    if (Ped(j,1) > 0 and Ped(j,2) == 0) {
      v(j) = D(j)*Sol(j) + (v(Ped(j,1)))*0.5;
    }
    if (Ped(j,1) == 0 and Ped(j,2) > 0) {
      v(j) = D(j)*Sol(j) + (v(Ped(j,2)))*0.5;
    }
  }


  for (l = 0; l < m; l++) { // final loop where we put the values only for genotyped animals
    Asub(k, l) = v(indx1(l));
  }


  } // end of k loop

} // end of OMP
Nesting the parallel directives is probably not what you want. https://docs.oracle.com/cd/E19205-01/819-5270/aewbi/index.html
The nested parallel for directive will spawn numOfThreads threads for each thread that is started by the first parallel directive.

Make sure you are enabling OpenMP as a compiler option. For icpc the documentation for OpenMP compiler options is located here https://software.intel.com/en-us/node/522690
I have checked my makefile (see below)

DIRECTIVES = -m64 -DMKL_ILP64
CFLAGS = -w -O2 -xsse4.2 -ansi-alias -ip -sox -openmp -mkl -DTEST_GAP -vec-report1 -par-report1 $(DIRECTIVES) $(CDIRS)
I can use –parallel instead of –openmp but it gives the same result.


If I use just #pragma omp parallel for num_threads(numOfThreads)
Wouldn’t it parallelize only the first loop?
The main work in my code is done by the i and j loops:
for (i = n-1; i > -1; i--) {}
for (j = 0; j < n; j++) {}
These I want to parallelize. Any suggestions? Thanks.
The way I have typically seen a block of code passed to omp parallel directive is like this:
1
2
3
4
5
#pragma omp parallel /* stuff */
{
//lines of code you want to run in parallel
//so if you had 3 for loops, every thread would run 3 for loops
}

or
1
2
#pragma omp parallel /* stuff */
//one line of code that will run in parallel 


The only case where I have seen omp parallel for directives is immediately before a for loop, but very rarely inside of an outer omp parallel directive (because of that whole each thread spawns more threads reason, but sometimes people want that).

You can also use #pragma omp for inside of an omp parallel section. If any of the threads depend on a computation from another thread, for example the first for loop needs to be complete before the next one can execute, the simplest fix is to use #pragma omp critical . I think #pragma omp for is what you probably want.
Thanks for the reply. I had 2 attempts both unsuccessful.


1).

#pragma omp parallel for num_threads(10)

for ( k = 0; k < m; k++)
{
// this is the main loop; some initialisation is done here
#pragma omp critical
for ( i = n; i > 0; i--)
{
// some work is done here; second loop
}
for ( j = 0; j <= n; j++)
{
// this is the third loop; final work is done here
}
for ( l = 0; l < m; )
{
// final loop; the results from loop 3 are transfered into the matrix
}

} // end of k loop

This code parallelises with the factor of 2 but produces wrong result

2).

#pragma omp parallel private(k) num_threads(10)

for ( k = 0; k < m; k++)
{
// this is the main loop; some initialisation is done here

#pragma omp parallel private(i) num_threads(10)

for ( i = n; i > 0; i--)
{
// some work is done here; second loop
}

#pragma omp parallel private(j) num_threads(10)

for ( j = 0; j <= n; j++)
{
// this is the third loop; final work is done here
}

#pragma omp parallel private(l) num_threads(10)

for ( l = 0; l < m; )
{
// final loop; the results from loop 3 are transfered into the matrix
}

} // end of k loop

This code parallelises again but it is much slower than the first one and produces again wrong result

I am completely lost. Help please. Thanks

I realize now that you are only parallelizing one big for loop, so I would think you need #pragma omp parallel for and possibly a few criticals inside. I thought there was code other than the k loop that you were putting into #pragma omp parallel .

I ran a little experiment with some printfs that show how various threads all repeat the i, j, and l loops for different values of k. Basically it shows that k is divided amongst all of the threads, and each thread runs it's own i, j, and l loop. Instead of declaring i, j, and l above the parallel section, I put int in the initializer of each for loop. This allows for less shared variables to need to be declared (which may be what is slowing down your code).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  #pragma omp parallel for num_threads(numOfThreads)
  for (int k = 0; k < m; k++) {  // this is the main loop
    for (int i = n-1; i > -1; i--) { // this is the second loop wwhere we add values to the relatives
      printf("[%d]: i = %d\n", omp_get_thread_num(), i);
    }

    for (int j = 0; j < n; j++) {   // this is the third loop; we calculate the values of the matrix
      printf("[%d]: j = %d\n", omp_get_thread_num(), j);
    }

    for (int l = 0; l < m; l++) { // final loop where we put the values only for genotyped animals
      //printf("[%d]: l = %d\n", omp_get_thread_num(), l);
    }
  } // end of k loop 


However, if n is much larger than m, you program may benefit more if you parallelize loops that depend on n. Parallezation is still a very experimental science.
Topic archived. No new replies allowed.