2d array performance

I perform 3loop matrix multiplication with different 2darray definition (pointer 2 pointer,template and basic definition like float A[size][size];) and measure MFLOPS of matrix multiplication I also define 2d arrays with float A[size][size] in C but interestingly the result was higher than c++ implementation. Result is as below.I want to know where is the problem and if I can achieve high MFLOPS in C++ implementation like in C implementation and how?.By the way as you can see the time in c is about 11 sec and is faster than c++ and
flopops is about 6 billion in c as in float A[size][size] definition in c++ but MFlops in c is higher than c implementation and that is because elapse time in c++ is more than c.
last point is that flops in pointer to pointer and template implementation is about 2 billion while in float A[size][size]array definition in both c and c++ is 6 billion?

Thanks

sample result:
in C:

using (float [1000][1000] in)
Real_time: 11.547754 Proc_time: 11.386652 Total flpins: 6004064768 MFLOPS: 527.289734

in c++:
using(float A[matrixSize][matrixSize];)
Real_time: 15.178202 Proc_time: 15.079040 Total flpops: 6003518464 MFLOPS: 398.136658

using (template A= AllocateDynamicArray<float> (matrixSize, matrixSize);)
Real_time: 19.024035 Proc_time: 18.516565 Total flpops: 2000023680 MFLOPS: 108.012665

using (float** DynamicArray;
float **A = new float*[matrixSize];
for (int i = 0; i < matrixSize; ++i)
A[i] = new float[matrixSize];)
Real_time: 19.031061 Proc_time: 18.447145 Total flpops: 2000024320 MFLOPS: 108.419174
Post the actual program.

It's very easy to write superficially similar programs that actually do very different things, especially when you compare different programming languages.

In general, an array of pointers to the first elements of dynamically-allocated arrays of floats (your "DynamicArray") will, *in general* always be slower to iterate than an actual 2D array (row-major contiguous array), no surprise there.
Last edited on
These are the code do you think they can perform too differently? .I know that c perform faster than c ++ but I need oop in c++ so i'd like to use c++ but gain performance of c or at least near c is it possible? (I mean 398MFLOPS with static array definition in c++ can not increase any more and pointer to pointer arrays are sentence to 108MFLOPS or not). I need to perform an experiments that matrix sizee will increase in different loops itteration so do you suggest to use actual 2d array .also at last do you think useing 1d array as a 2d array will be better or not (in terms of performance and if calculating indexto access rows and colums dose not consume cpu cycles)

Thanks

c++ implementaion

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
DynamicArray(int rows, int cols): dArray(rows, vector<T>(cols)){}

    vector<T> & operator[](int i)
    {
      return dArray[i];
    }
    const vector<T> & operator[] (int i) const
    {
      return dArray[i];
    }

int main() {

	float real_time, proc_time, mflops;
	long long flpops;
	float ireal_time, iproc_time, imflops;
	long long iflpops;
	int retval;
	int i, j, k;
	int matrixSize = 1000;
	int loopSteps = 1;

//first definiion
/*
	float** DynamicArray;

	float **A = new float*[matrixSize];
	for (int i = 0; i < matrixSize; ++i)
		A[i] = new float[matrixSize];

	float **B = new float*[matrixSize];
	for (int i = 0; i < matrixSize; ++i)
		B[i] = new float[matrixSize];

	float **C = new float*[matrixSize];
	for (int i = 0; i < matrixSize; ++i)
		C[i] = new float[matrixSize];
*/

//second definition
/*
	float ** A;
	float ** B;
	float ** C;
	A= AllocateDynamicArray<float> (matrixSize, matrixSize);
	B= AllocateDynamicArray<float> (matrixSize, matrixSize);
	C= AllocateDynamicArray<float> (matrixSize, matrixSize);


*/


//third definition
/*	
	float A[matrixSize][matrixSize];
	float B[matrixSize][matrixSize];
	float C[matrixSize][matrixSize];
*/

	
//filling array with (same) numbers
	for (int i = 0; i < matrixSize; i += loopSteps) {
		for (int j = 0; j < matrixSize; j += loopSteps) {
			A[i][j] = B[i][j] = (float) 10.01; // rand() * (float)1.1;
			C[i][j] = (float) 0.0;
		}
	}

	
	start_calculate_MFLOPS();

	/* Matrix-Matrix multiply */
	for (i = 0; i < matrixSize; i++)
		for (j = 0; j < matrixSize; j++)
			for (k = 0; k < matrixSize; k++)
				C[i][j] = C[i][j] + A[i][k] * B[k][j];

	stop_calculate_MFLOPS();

	//confuse compiler not to optimize away
	dummy((float **) C);
	printf(Real_time: %f Proc_time: %f Total flpops: %lld MFLOPS: %f\n",
			real_time, proc_time, flpops, mflops);

	return 0;
} 


c implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
float matrixa[INDEX][INDEX], matrixb[INDEX][INDEX], mresult[INDEX][INDEX];

int main(int argc, char **argv)
{
   float real_time, proc_time, mflops;
   long_long flpins;
   int retval;
   int i, j, k;

   /* Initialize the Matrix arrays */
   for (i = 0; i < INDEX * INDEX; i++) {
      mresult[0][i] = 0.0;
      matrixa[0][i] = matrixb[0][i] =10.01;// rand() * (float) 1.1;
   }

   start_calculate_MFLOPS();

   /* Matrix-Matrix multiply */
   for (i = 0; i < INDEX; i++)
      for (j = 0; j < INDEX; j++)
         for (k = 0; k < INDEX; k++)
            mresult[i][j] = mresult[i][j] + matrixa[i][k] * matrixb[k][j];

   stop_calculate_MFLOPS();

   dummy((void *) mresult);

   printf(format_string, real_time, proc_time, flpins, mflops);
   exit(0);
}
I know that c perform faster than c ++

There are many situations where C++ performs faster than C. The canonical example is std::sort() vs. qsort()

do you think useing 1d array as a 2d array will be better or not

It is how matrices and higher-dimensional classes are implemented. The underlying data structure is almost always a 1D vector, 1D valarray, or some custom 1D indexed object.

Now, comparing your C program and your C++ program "3rd definition", they are indeed pretty much the same: the only real difference is that your C example allocates the arrays in static data section, while your C++ example uses main() function stack.

Given enough stack, this indeed should not matter except that in stack-allocated case, the compiler knows that clock() (or, in your case, start_calculate_MFLOPS() and stop_calculate_MFLOPS()) cannot access the matrices and it is free to move the calls to clock() with respect to the rest of the code in main(). As a sanity check, I timed the total runtime of each program using external tools, it wasn't very different from self-reported clock()/clock() time, but it explains slight preference to C++ in my timing below.

Here are exactly as compiled tests that I just ran

I increased your 1000x1000 to 2000x2000 because it's way too fast on Intel (with Intel compiler) otherwise

I changed you dummy trick to an honest use of the result because one of my compilers (IBM XL) saw right through the trick.

My C test:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <stdio.h>
#include <time.h>

#define INDEX 2000

float matrixa[INDEX][INDEX], matrixb[INDEX][INDEX], mresult[INDEX][INDEX];
int main()
{
   /* Initialize the Matrix arrays */
   for (int i = 0; i < INDEX * INDEX; i++) {
      mresult[0][i] = 0.0f;
      matrixa[0][i] = matrixb[0][i] = 10.01f;// rand() * (float) 1.1;
   }

   clock_t time_start = clock();

   /* Matrix-Matrix multiply */
   for (int i = 0; i < INDEX; i++)
      for (int j = 0; j < INDEX; j++)
         for (int k = 0; k < INDEX; k++)
            mresult[i][j] = mresult[i][j] + matrixa[i][k] * matrixb[k][j];

   clock_t time_end = clock();

   double result= 0 ;
   for (int i = 0; i < INDEX; i ++) 
        for (int j = 0; j < INDEX; j ++)
            result += mresult[i][j];
   printf("CPU time %lf sec.\nresult = %lf\n", (time_end - time_start) / (double)CLOCKS_PER_SEC, result);
}


My C++ test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <cstdio>
#include <ctime>

const int matrixSize = 2000;
const int loopSteps = 1;
int main()
{
    float A[matrixSize][matrixSize];
    float B[matrixSize][matrixSize];
    float C[matrixSize][matrixSize];

        for (int i = 0; i < matrixSize; i += loopSteps) {
                for (int j = 0; j < matrixSize; j += loopSteps) {
                        A[i][j] = B[i][j] = 10.01f;
                        C[i][j] = 0.0f;
                }
        }

    std::clock_t time_start = std::clock();

        /* Matrix-Matrix multiply */
        for (int i = 0; i < matrixSize; i++)
                for (int j = 0; j < matrixSize; j++)
                        for (int k = 0; k < matrixSize; k++)
                                C[i][j] = C[i][j] + A[i][k] * B[k][j];

    std::clock_t time_end = std::clock();

   double result= 0 ;
   for (int i = 0; i < matrixSize; i ++) 
        for (int j = 0; j < matrixSize; j ++)
            result += C[i][j];

    std::printf("CPU time %lf sec.\n result = %lf\n", (time_end - time_start) / (double)CLOCKS_PER_SEC, result);
}


Results, from 5 runs
   Intel platform
Intel icc 13.0.0   1.08 -  1.93 (avg  1.38) sec (-Ofast -xHost)
Intel icpc 13.0.0  0.98 -  1.29 (avg  1.05) sec (-Ofast -xHost)
GNU gcc 4.7.2     18.73 - 20.12 (avg 19.32) sec (-O3 -march=native)
GNU g++ 4.7.2     18.71 - 20.21 (avg 19.11) sec (-O3 -march=native)
   IBM platform
IBM XL C 11.1      5.67 -  5.74 (avg  5.72) sec (-O5)
IBM XL C++ 11.1    5.58 -  5.76 (avg  5.65) sec (-O5)
GNU gcc 4.7.2   100+ seconds, I got bored there, sorry
Last edited on
Wow! Is GCC really that bad?
That's why we don't use it! (except for its compiler diagnostics)
Last edited on
Thanks cubbi for taking time for that experiments all together.I'v read about intel compiler which is fully optimized for intel cpus like amd.and also about IBM XL but I have not a chance to take shot at it.Although your result was great but actually I did not get answer to my questions.

First Question
I asked for difference between MFLOPS and elapsed time for computing Matrix Multiplication in c and c++ under GNU GCC 4.7.2 fedora 17 or totally under the same circumstance and not comparing different compilers. as you can see in the first comment FLops for c and c++ when using float A[matrixSize][matrixSize]; is about 6003518464 but the computing time is different and cause c++ to have less MFLOPS than c and other type of array definition as is in comment 2 in c++ not only time increased but also FLOPS(flpops) is declined to 2000023680 operations.
So what should I do to let c++ perform like c in terms of time and MFLOPS?
Why floating point operation comes down when using dynamic array(pointer to pointer)?
Pointer to pointer array will not store row-major in memory in programming languages like c, c++...?

second Question
about maping 2D array to 1D I meant something like this int array[width * height]; and accessing this 2D array by this array[width * row + col] = value; and did not mean the system row-wise implementation of arrays in memory. I meant that calculating of array indices like this array[width * row + col] = value; is not cpu time consuming than accessing 2D array like this array[row][column]

new question
wher can I read about the situation that c++ perform faster than c.


Thanks
Last edited on
I asked for difference between MFLOPS and elapsed time for computing Matrix Multiplication in c and c++ under GNU GCC 4.7.2 fedora 17 or totally under the same circumstance and not comparing different compilers.

Right, and I did just that, compared C and C++ totally under the same circumstance. Then I did it again using a couple more compilers/platforms to show that
a) the trend is always the same: C++ is slightly faster, but it's not because of C++ vs C, but rather because your programs are different.
b) the compiler choice matters a great deal more

About maping 2D array to 1D I meant something like this int array[width * height];

Yes, that's what I meant when I said "It is how matrices and higher-dimensional classes are implemented."
Topic archived. No new replies allowed.