Make Sparse Matrix Multiply Fast

Question

The code is written using C++11. Each Process got tow Matrix Data(Sparse). The test data can be downloaded from enter link description here

Test data contains 2 file : a0 (Sparse Matrix 0) and a1 (Sparse Matrix 1). Each line in file is "i j v", means the sparse matrix Row i, Column j has the value v. i,j,v are all integers.

Use c++11 unordered_map as the sparse matrix's data structure.

unordered_map<int, unordered_map<int, double> > matrix1 ;
matrix1[i][j] = v ; //means at row i column j of matrix1 is value v;

The following code took about 2 minutes. The compile command is g++ -O2 -std=c++11 ./matmult.cpp.

g++ version is 4.8.1, Opensuse 13.1. My computer's info : Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz, 4G memory.

#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>

using namespace std;

void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
  ifstream input ;
  input.open(fn);
  int i, j ; double v;
  while (input >> i >> j >> v)  {
    m[i][j] = v;
  }
}

unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
//vector<vector<int> > keys(BLK_SIZE);

int main() {
  load("./a0",m1);
  load("./a1",m2);

  for (auto r1 : m1) {
    for (auto r2 : m2) {
      double sim = 0.0 ;
      for (auto c1 : r1.second) {
        auto f = r2.second.find(c1.first);
        if (f != r2.second.end()) {
           sim += (f->second) * (c1.second) ;
        }
      }
   }
  }
  return 0;
}

The code above is too slow. How can I make it run faster? I use multithread. The new code is following, compile command is g++ -O2 -std=c++11 -pthread ./test.cpp. And it took about 1 minute. I want it to be faster.

How Can I make the task faster? Thank you!

#include <iostream>
#include <fstream>
#include <unordered_map>
#include <vector>
#include <thread>

#define BLK_SIZE 8

using namespace std;

void load(string fn, unordered_map<int,unordered_map<int, double> > &m) {
  ifstream input ;
  input.open(fn);
  int i, j ; double v;
  while (input >> i >> j >> v)  {
    m[i][j] = v;
  }
}

unordered_map<int,unordered_map<int, double> > m1;
unordered_map<int,unordered_map<int, double> > m2;
vector<vector<int> > keys(BLK_SIZE);

void thread_sim(int blk_id) {
  for (auto row1_id : keys[blk_id]) {
    auto r1 = m1[row1_id];
    for (auto r2p : m2) {
      double sim = 0.0;
      for (auto col1 : r1) {
        auto f = r2p.second.find(col1.first);
        if (f != r2p.second.end()) {
          sim += (f->second) * col1.second ;
        }
      }
    }
  }
}

int main() {

  load("./a0",m1);
  load("./a1",m2);

  int df = BLK_SIZE - (m1.size() % BLK_SIZE);
  int blk_rows = (m1.size() + df) / (BLK_SIZE - 1);
  int curr_thread_id  = 0;
  int index = 0;
  for (auto k : m1) {
    keys[curr_thread_id].push_back(k.first);
    index++;
    if (index==blk_rows) {
      index = 0;
      curr_thread_id++;
    }
  }
  cout << "ok" << endl;
  std::thread t[BLK_SIZE];
  for (int i = 0 ; i < BLK_SIZE ; ++i){
    t[i] = std::thread(thread_sim,i);
  }
  for (int i = 0; i< BLK_SIZE; ++i)
    t[i].join();

  return 0 ;
}

It seems that you have answer your own question, how mush faster/slower is the second than first? Didi you tested it? — Baj Mile, Feb 07 '15 at 09:42
@BajMile The first one took more than 2minutes, the second code took about 1minutes. I want it to be faster. — kainwen, Feb 07 '15 at 09:45
Well, you copy a lot in your loops because of `auto`. Make them `auto const&` so you don't waste all that time. Faster io could not hurt. And actually time the components, determine what component is taking time, so you can foxus on making that paet faster. Oh, and because half the people who post here forget, tell compiler to optimize. — Yakk - Adam Nevraumont, Feb 07 '15 at 10:33
see: http://stackoverflow.com/questions/15693584/fast-sparse-matrix-multiplication,try with different containers and not using auto, if you need even faster I believe that there is even faster than CSR and CSC but I need to make tests — Baj Mile, Feb 07 '15 at 11:08
@Yakk Follow your advice, using -O3 flag and auto &, now the second code took about 16seconds on my notebook. Seems really fast. Thank you! — kainwen, Feb 07 '15 at 11:09
@BajMile Thank you. For some other reasons, Here the matrix type must be nested unordered_map. Otherwise, I will use eigen3 sparsematrix lib. — kainwen, Feb 07 '15 at 11:15
I am just curious, if you replace - unordered_map > m1; with unordered_map and insert in the two maps the would if will be faster? You will spare one O(log(N)) in this way. — Baj Mile, Feb 07 '15 at 20:11

score 1 · Answer 1 · answered Feb 07 '15 at 10:54

1

Most times when working with sparse matrices one uses more efficient representations than the nested maps you have. Typical choices are Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC). See https://en.wikipedia.org/wiki/Sparse_matrix for details.

answered Feb 07 '15 at 10:54

janneb

36,249
2
81
97

For some other reasons, here the MATRIX TYPE must be nested unordered_map. Otherwise, I will use eigen3. – kainwen Feb 07 '15 at 11:10

score 0 · Answer 2 · answered Feb 07 '15 at 11:17

You haven't specified the time you expect your example to run in or the platform you hope to run on. These are important design contraints in this example.

There are several areas that I can think of for improving the efficeny of this:-

Improve the way the data is stored
Improve the multithreading
Improve the algorithm

The first point is geared toward the way the system stores the sparse arrays and the interfaces to enable the data to be read. Nested unordered_maps are a good option when speed isn't important but there may be more specific data structures available that are geared toward this problem. At best you may find a library that provides a better way to store the data than nested maps, at worst you may have to come up with something yourself.

The second point refers to the way the multithreading is supported in the language. The original spec for the multithreading system were meant to be platform independant and might miss out handy features some systems might have. Decide what system you want to target and use the OSs threading system. You'll have more control over the way the threading works, possibly reduce the overhead but will lose out on the cross platform support.

The third point will take a bit of work. Is the way you're multiplying the matricies really the most efficent way given the nature of the data. I'm no expert on these things but it is something to consider but it will take a bit of effort.

Lastly, you can always be very specific about the platform you're running on and head into the world of assembly programming. Modern CPUs are complicated beasts. They can sometimes perform operations in parallel. For example, you may be able to do SIMD operations or do parallel integer and floating point operations. Doing this does require a deep understanding of what's going on and there are useful tools to help you out. Intel did have a tool called VTune (it may be something else now) that would analyse code and highlight potential bottlenecks. Ultimately, you'll be wanting to eliminate areas of the algorithm where the CPU is idle waiting for something to happen (like waiting for data from RAM) either by finding something else for the CPU to do or improving the algorithm (or both).

Ultimately, in order to improve the overall speed, you'll need to know what is slowing it down. This generally means knowing how to analyse your code and understand the results. Profilers are the general tool for this but there are platform specific tools available as well.

I know this isn't quite what you want but making code fast is really hard and very time consuming.

Thank you. Follow @Yakk's advice, use auto & and g++ -O3, the program runs almost 3 times faster. (Mainly because of auto &) — kainwen, Feb 07 '15 at 11:28

Make Sparse Matrix Multiply Fast

2 Answers2