### Slow matrix multiplication

Hi there I was just trying my matrix class to perform a 1000x1000 matrices multiplication, and I have observed that the command C=A*B runs is TREMENDOUSLY slow as compared to Matlab or so, here is the algorithm I am using, which is the easiest one but nevertheless I dont think there should be so much difference right?

 ``1234567891011121314151617181920212223242526272829303132333435363738`` `````` matrix matrix::operator*(const matrix& M){ int n2,m2; n2=M.nrows(); m2=M.ncols(); matrix temp(x,m2); if(n2!=y){ cout<<"Matrices cannot be multiplied (size mismatch) "<

EDIT 1* : I know there are three concatenated loops, but is it that what is really causing the slowness of the calculation?

EDIT 2* : Might also be helpful to include the code for the equal assignment operator and the copy constructor

 ``123456789101112131415161718192021222324252627282930313233343536`` ``````//COPY CONSTRUCTOR matrix::matrix(const matrix& M){ double** copymatrix=new double*[M.nrows()]; for(int i=0;idealloc(); double** new_matrix = this->alloc(a.nrows(), a.ncols()); for(int i = 0; i < a.nrows(); i++) { for(int j = 0; j < a.ncols(); ++j) { new_matrix[i][j] = a.get(i,j); } } this->the_matrix = new_matrix; return *this; } ``````
Last edited on
what is temp.fill doing? Why not temp[some row][some col] = kk?

your = operator should do a single memcpy or similar statement instead of one by one
looping for the actual matrix data (there may be other variables to set like rows and cols etc).

the copy ctor should do that also, possibly reusing the assignment op internally if possible.

I can post my multiply, but its very old code and not really up to modern design or standards, if you want to see it.

couple of tweak ideas..

the fastest I was able to do it was by transposing the second matrix so BOTH matrices index off row major ordering. This adds an up-front cost but the multiply function is significantly faster with this because it greatly reduces the page faults.

you can also short out zeros so that rather than do 1000 multiplies for a zero element, you just skip it.

There is a faster multiply algorithm beyond brute force but it is complex, you can google it.

1000x1000 is not that big, it should be doable in less than a second.
Last edited on
Hi thanks for the reply, here is the function fill for the class matrix

 ``1234`` ``````void matrix::fill(double pp,int row,int col){ the_matrix[row][col]=pp; }``````

because I couldn't think about another way of changing the_matrix[i][j] element from external data... So the "=" operator might be the big problem then or the copy constructor? I don't understand very well what do you mean by making the copies.

About the ideas, thank you for them, they can really make a difference I believe, I guess Matlab is fully implemented with those. However, when I say my code is slow as compared to Matlab is that, a 1000x1000 product of matrices in Matlab costs a time of order 0.04 seconds after allocation, whereas in my code is fairly taking like 16 seconds.... the difference shouldnt be that big I believe, even when more sophisticated algorithms like the one you are suggesting work... It seems to change O(n^{3}) to O(n^{log(7)})

EDIT 1: Ok I just discovered that the problem lies in the matrix multiplication itself. When I perform it without using the "=" operator, that is, only doing A*B but not C=A*B, this consumes a lot of time too, any hints?
Last edited on
that fill function should not be an issue.

16 seconds is too long by a ton.

the alternative algorithm shouldnt be necessary unless you need it VERY fast. Again, I think you should be able to run a 1k by 1k in under a second with the basic brute force shown here. You HAVE to be doing a copy or something inside that is bogging it down.

while I look, can you time ONLY the multiply and then time ONLY an assignment operation?

Edit it has to be that.
Let me guess ... in main you have

R = A*B;

my code was 1) realtime and 2) 1990s era hardware so it did its own memory management and kept a huge pool of memory for temporary intermediate results all the time. So I would have dumped the results into the pool and returned a pointer to it, eliminating the creation of temp and the copy back out of temp to the answer matrix. Terrible design, but it was fast.

Time just
A*B;
R= A*B; That will answer where the problem lies.
Last edited on
Ok when I perform A*B, the total time taken is 15.7 seconds, whereas when I do C=A*B time is 15.95 seconds, I would say the problem really lies within the overloading of the operator *. MAYBE THE IF CONDITION INSIDE THE BODY?

 ``12345678910111213141516171819202122232425262728293031`` ``````//main function #include #include #include #include "matrix.h" #include #include #include #define pi 3.1415926535 #include "EIGENproblem.h" using namespace std; int main(){ matrix A(1000,1000),B(1000,1000),C(1000,1000); A.ones(); B.randomS(); cout<<"done"<

I just discovered the problem lies in the line kk+=(the_matrix[i][k])*(M.get(k,j)) in the overloading function, why can that be?? maybe the brackets and the fact that I am also using the "*" operator inside an overload of the "*" operator (although its for matrices and the other for doubles but apart from that, what else can it be?)

If I comment the line \$kk+=(the_matrix[i][k])*(M.get(k,j)) \$ then the issue disappears, why is that happening?? When I comment that line, time is 2.54 seconds, which is not "fast" I would say, but at least 1/8 less roughly

 Time just A*B; instead of R= A*B; That will answer where the problem lies.

I did, in the first case is 15.7 in the second 15.95.

UPDATE: It is the "get" function what is making the trouble, but seriously I don't know why this could be as get has the following simple form:
 ``123456`` `````` double matrix::get(int row,int column) const{ return the_matrix[row][column]; } ``````
Last edited on
great. Lets look at those.... gimme a few min im multi-tasking.

in the meantime, remove the call to .get and time it again. Directly access M's data instead (I can't recall if you need to make that public for testing purposes only, or if it is already public to you because you are in the matrix class). Whatever it takes to test it for a moment, you can undo it after.

if you comment out that line the routine isnt doing much work :)
That may or may not be the problem --- checking the get function will tell you.

also pay attention to your values. 1000x 1000 is a million, and a million cubed is substantial. A million to a power less than 1 is much less than a million cubed. You can check the O(N) where N = 1 million to see if you think you need the other approach.

also, the obvious 1 row X 1 column shot off to worker threads on an 8+ core cpu would help... you have to watch these, thread creation has a cost, but done correctly it would cut the time linearly by # of CPU.

also if you get frustrated enough, clapack * friends library does it for you, but it also has a moderate learning curve as all the routines expect you to factor, normalize, and do various things before you can do anything. I used it on the eigen stuff and in a few places, but my code was set up to take a .M file and with a few changes it was C++. Very few changes; I had a lot of M files to convert at that time.

Last edited on
I was trying to change the get function by this

 ``12345678`` `````` double matrix::get(int row,int column) const{ this-> the_matrix[row][column]; return *this; }``````

But error message displays saying that cannot convert "matrix const" to "double" in return, isnt the return a double element? I am confused
that is not right.
you want

return this->the_matrix[row][column];

this is a matrix *, the function returns a double, not the same.

I get const syntax mixed up myself. Not sure about the const right this second.

are you compiling with optimizations on??

I just did a dumb triple loop that approximates your multiply with optimization on and off. Difference was < 1 second vs many, many seconds.
Last edited on
What do you mean by optimizing the compiler? The time I am talking about is actual calculation time, not compilation time.
Last edited on
did you remember to compile the code with optimizations on? If you forget it will be in "debug" mode and the executable result is full of junk that makes it slower. In cases like this one, MUCH slower.

EG for g++ that is the O2 or similar options. For visual studio it is flipping to "release build" toggle. Other tools do it differently, but this is critical.

Your answer indicates that this is likely the problem, that and your code looks pretty darn good apart from the little stuff I offered.
Last edited on
since I am waiting on things to happen and stuck looking at my laptop anyway..

transpose is simply:
I personally used the !() operator overload for it.

 ``123456789`` `````` for(k = 0; k < rows; k++) { for(j = 0; j < cols; j++) { result.data[j][k] = internaldata[k][j]; } } ``````

OK, I am trying to translate the stuff I had which had <complex> data type and conditions for that coupled with single dimensional data storage with manual indexing into something you can read. I HOPE I did not mangle anything in the process. I know my original works.

The multiply is, in "pseudoC"

 ``123456789101112131415161718192021`` ``````operator *(matrix b) set up result matrix; matrix at = transpose(this); //remember, I didnt copy anything but a pointer here due to the // internal memory management. This is a little more costly for your code. for(brows = 0; brows < b.rows; brows++) { for(bcols = 0; bcols < b.cols; bcols++) { con = b.data[brows][bcols]; if(con) //is not zero... { //... grumble... the version I found of it does not actually use the transpose correctly. the idea is that whatever is in the inner loop is accessing both the result and the AT matrix by iterating the column dimension while the rows are static. looks something like result[something ] += at[something]*con but I am having trouble re-creating it tonight. } return result ``````

Hopefully that makes sense to explain the transpose & skip zero ideas.
Last edited on
> I just discovered the problem lies in the line kk+=(the_matrix[i][k])*(M.get(k,j)) ... why can that be??

(For large matrices) "every iteration of the inner loop incurs a cache miss when accessing an element of B. This means that the algorithm incurs Θ(n3) cache misses in the worst case. ... the cache misses, rather than the actual calculations, dominate the running time for sizable matrices."
https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Cache_behavior

With Eigen, this multiplication (matrices of 2 million elements) takes about 300 milliseconds on a typical machine.
http://eigen.tuxfamily.org/index.php?title=Main_Page#Overview

 ``12345678910111213141516171819202122232425262728`` ``````#include #include #include int main() { // dynamically allocated matrix containing values of type double using matrix_type = Eigen::MatrixXd ; const auto start = std::clock() ; constexpr std::size_t N = 1'000 ; constexpr std::size_t M = 2'000 ; // create a and b with random finite double values const matrix_type a = matrix_type::Random( N, M ) ; // 1000 x 2000 const matrix_type b = matrix_type::Random( M, N ) ; // 2000 x 1000 const auto mid = std::clock() ; const matrix_type c = a * b ; // multiply const auto finish = std::clock() ; std::cout << "matrix creation: " << (mid-start) * 1000.0 / CLOCKS_PER_SEC << " millisecs\n" << " multiplication: " << (finish-mid) * 1000.0 / CLOCKS_PER_SEC << " millisecs\n" ; return c(0,0) == 0 ; }``````

MATLAB (on intel processors) should be faster; IIRC, it internally use Intel's MKL.
that page fault hit is exactly what the transpose fixes.

Back when, matlab was slower. Probably because it was doing correct numerics (normalize and probably L/U). At some point around the time multi core cpus came out it started to really smoke the brute force C. Most likely whatever is under the hood is multi-threaded, using a better algorithm (this is, after all, brute force), using correct memory access, and so on. If there is any ONE routine in LA that is examined and tweaked to the max, its the multiply.

We were doing control theory stuff with stable well conditioned values so we could skip the 'general purpose' stability checks and adjustments.

Last edited on
> that page fault hit is exactly what the transpose fixes.

Even after the transpose, there may be many cache misses for large matrices.
The high performance matrix multiplication algorithms typically use tiling or block partitioning
(in conjunction with expression templates in C++).
UPDATE: OK guys, thank you all for your responses and become interested in the poblem. I have set the compiler options to optimization level -Ox to the fastest, and the elapsed time of the single product A*B, for 1000x1000 matrices is 10.6 seconds, as compared to the ~18 seconds we had initially. Although some improvement has been done, I still think is a considerable amount of time of calculation.

 (For large matrices) "every iteration of the inner loop incurs a cache miss when accessing an element of B. This means that the algorithm incurs Θ(n3) cache misses in the worst case. ... the cache misses, rather than the actual calculations, dominate the running time for sizable matrices." https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Cache_behavior

Is there a way to make this better in some way? Of course I could try the easy path, take Eigen, learn all declarations and run my program... but that 's not where the challenge is! Also I want to understand what makes such a big difference between creating your own matrix class with operations and those used by programs like MATLAB.
Last edited on
 ```With: g++ mat.cpp Time taken: 16.2400 seconds```

 ```With: g++ -O mat.cpp Time taken: 1.0770 seconds```

Hmm: time to turn the optimiser on

 ``1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495`` ``````#include #include #include #include #include #include #include using namespace std; using vec = vector; using matrix = vector; // Function prototypes void print( const matrix &A ); void randfill( matrix &M ); void matmul( const matrix &A, const matrix &B, matrix &C ); //====================================================================== int main() { const int N = 1000; matrix A( N, vec( N ) ), B = A, C = A; srand( time( 0 ) ); randfill( A ); randfill( B ); clock_t tstart = clock(); matmul( A, B, C ); double t = ( double ) ( clock() - tstart ) / CLOCKS_PER_SEC; cout << "Time taken: " << fixed << setprecision( 4 ) << t << " seconds\n"; // print( A ); print( B ); print( C ); // Don't use when N is large! } //====================================================================== void print( const matrix &A ) { for ( auto row : A ) { for ( auto col : row ) cout << col << '\t'; cout << '\n'; } cout << '\n'; } //====================================================================== void randfill( matrix &M ) // not important: quick and dirty fill { int rows = M.size(); int cols = M[0].size(); for ( int i = 0; i < rows; i++ ) { for ( int j = 0; j < cols; j++ ) M[i][j] = rand() % 10; } } //====================================================================== void matmul( const matrix &A, const matrix &B, matrix &C ) { int rowsA = A.size(); int colsA = A[0].size(); int rowsB = B.size(); assert( colsA == rowsB ); int colsB = B[0].size(); // Multiply: Cij = Aik Bkj (implied summation convention) // = Aik BTjk // = inner product of A (ith row) and (B transpose)(jth row) // Precompute transpose matrix BT( colsB, vec( rowsB ) ); for ( int i = 0; i < colsB; i++ ) { for ( int j = 0; j < rowsB; j++ ) BT[i][j] = B[j][i]; } // Set of inner products for ( int i = 0; i < rowsA; i++ ) { for ( int j = 0; j < colsB; j++ ) C[i][j] = inner_product( A[i].begin(), A[i].end(), BT[j].begin(), 0.0 ); } }``````

Well I have all the matrix class defined in a different file .h , and the body functions are in another file with .cpp extension, as different to the code you are posting above (you define all in a single file where the main function is), can this also make a difference? Also, I thought I have turned the optimiser on because there has been a decreasing from ~17-18 seconds to 10 seconds or so, but still cannot reach the seconds scale!!
I suspect - but haven't time to investigate - that the use of the highly-optimised std::inner_product makes a huge difference. Pre-computing the transpose allows one contiguously stored row vector to be inner-producted with another, which should help as well (although it demands some resources in both time and memory to set up).

Once you have linked the object code then the fact that things started in different files is irrelevant. I can't see repeated application of M.get() for single elements being very helpful, though. You are within a member function of the class, so it shouldn't be necessary.
Last edited on
> Is there a way to make this better in some way?

Yes; transposing and then multiplying would make it faster
(with 1000x1000 matrices, g++ -O3 on coliru, a shade over twice as fast).

 ``123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596`` ``````#include #include #include #include template < typename T > struct matrix { matrix( std::size_t nrows, std::size_t ncols ) : mat( nrows, std::vector(ncols) ) {} std::size_t nrows() const { return mat.size() ; } std::size_t ncols() const { return mat.empty() ? 0 : mat.front().size() ; } std::vector& operator[] ( std::size_t row_num ) { return mat[row_num] ; } const std::vector& operator[] ( std::size_t row_num ) const { return mat[row_num] ; } auto begin() { return mat.begin() ; } auto begin() const { return mat.begin() ; } auto end() { return mat.end() ; } auto end() const { return mat.end() ; } auto cbegin() const { return mat.begin() ; } auto cend() const { return mat.end() ; } auto rbegin() { return mat.rbegin() ; } auto rbegin() const { return mat.rbegin() ; } auto rend() { return mat.rend() ; } auto rend() const { return mat.rend() ; } auto crbegin() const { return mat.rbegin() ; } auto crend() const { return mat.rend() ; } private: std::vector< std::vector > mat ; }; template < typename T > matrix transpose( const matrix& mtx ) { matrix result( mtx.ncols(), mtx.nrows() ) ; for( std::size_t r = 0 ; r < mtx.nrows() ; ++r ) for( std::size_t c = 0 ; c < mtx.ncols() ; ++c ) result[c][r] = mtx[r][c] ; return result ; } #ifdef TRANSPOSE_AND_MULTIPLY template < typename T > matrix operator* ( const matrix& a, const matrix& b ) { // TO DO: validate sizes matrix result( a.nrows(), b.ncols() ) ; matrix bt = transpose(b) ; for( std::size_t i = 0 ; i < a.nrows(); ++i ) for( std::size_t j = 0; j < b.ncols() ; ++j ) for( std::size_t k = 0; k < a.ncols() ; ++k ) result[i][j] += a[i][k] * bt[j][k] ; return result ; } #else template < typename T > matrix operator* ( const matrix& a, const matrix& b ) { // TO DO: validate sizes matrix result( a.nrows(), b.ncols() ) ; for( std::size_t i = 0; i < a.nrows() ; ++i ) for( std::size_t j = 0; j < b.ncols() ; ++j) for( std::size_t k = 0; k < a.ncols() ; ++k) result[i][j] += a[i][k] * b[k][j]; return result ; } #endif // TRANSPOSE_AND_MULTIPLY int main() { const std::size_t N = 1000, M = 1000 ; std::mt19937 rng( std::random_device{}() ) ; std::uniform_real_distribution distr( -10.0, +10.0 ) ; matrix a( N, M ) ; for( auto& row : a ) for( auto& v : row ) v = distr(rng) ; matrix b( M, N ) ; for( auto& row : b ) for( auto& v : row ) v = distr(rng) ; const auto start = std::clock() ; auto c = a * b ; const auto fin = std::clock() ; std::cout << (fin-start) * 1000.0 / CLOCKS_PER_SEC << " millisecs\n" ; }``````

 ```==================== without TRANSPOSE_AND_MULTIPLY ==================== 4232.37 millisecs```

http://coliru.stacked-crooked.com/a/a4575a5aebbe0705

 ```======================= with TRANSPOSE_AND_MULTIPLY ==================== 1907.64 millisecs```

http://coliru.stacked-crooked.com/a/a61b9429da69a8d9

You may want to have a look at this: much of the code is "C compiled with a C++ compiler"), but it is quite easy to understand:
https://github.com/attractivechaos/matmul

We can see from the results (caveat: obsolete GNU compiler; though this is typical mainstream amongst the Linux crowd) that Eigen is very fast. Eigen is just a set of headers (it is copyleft, but the license is a lot saner than LGPL); have a look at its source code if you are comfortable with expression templates (C++98 would do).
If you just want results, that eigen looks like a very good option. You have 3-4 other options ... one of the lapack family (I used the intel math kernel version), you can also tell matlab itself to generate an executable and you MAY be able to beat a C or CPP file out of it, I forget if they got that working properly, or one of the other libraries mentioned (some of this stuff is pay to win, but time is money).

Writing your own can be fun, and the code for basic matrix stuff is not overwhelming at first, but you are re-inventing the wheel and the pro library guys have a tire factory while you have a rock, a hammer, and a chisel. This is a place to "buy not build".

If you insist on writing your own, one last thing on top of the above. You can try valarray instead and use the slice to generate the inner product chunks. Multi-thread that. You might even beat matlab's speed if you decide to dig into the problem. Not sure if the valarray is any better or not, its just something to try. Multi-threaded should tear it apart though.
Last edited on
Topic archived. No new replies allowed.