1

I need to compute the cross-product of a huge matrix Xand a vector y. X has n=2500 rows and p=1000,000 columns. y is 1000-by-1 vector. Since X file is very huge and cannot be loaded into RAM, I divided it into 10 chunks, each with 2500 rows and chunk_cols = 100,000 columns. Then I read the data chunk-by-chunk into memory and do crossprod column-by-column.

I should also mention that X file is preprocessed into a binary file, which stores the matrix column-by-column into a long long vector. So in the binary file, the first (double) element is X[0,0], the second is X[1,0], so on and so forth.

I tried to parallelize the crossprod part with Opemmp, hoping to speed up the computation. But it turns out that parallelization is much worse, 50 seconds slower.

4 cores:

Unit: seconds
                                                                expr      min       lq
 res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 4) 122.5716 122.5716
     mean   median       uq      max neval
 122.5716 122.5716 122.5716 122.5716     1

1 core:

Unit: seconds
                                                                expr      min       lq
 res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 72.56355 72.56355
     mean   median       uq      max neval
 72.56355 72.56355 72.56355 72.56355     1

Below is my code.

// [[Rcpp::export]]
NumericVector crossprod_cpp(SEXP filename, int chunk_cols, const std::vector<double>& y, 
                            int n, int p, int useCores) {
  NumericVector result(p);
  unsigned long int chunk_size = chunk_cols * sizeof(double) * n;
  const char *xfname = CHAR(Rf_asChar(filename));
  ifstream xfile(xfname, ios::in|ios::binary);

  int i, j;

  if (xfile.is_open()) {
    streampos size_x;
    char *memblock_x;
    int chunk_i = 0;
    int chunks = 0;
    int col_pos = 0;

    xfile.seekg (0, ios::end);
    size_x = xfile.tellg();
    xfile.seekg (0, ios::beg);
    chunks = size_x / chunk_size;
    double *X;

    omp_set_dynamic(0);
    omp_set_num_threads(useCores);

    memblock_x = (char *) calloc(chunk_size, sizeof(char));
    for(chunk_i = 0; chunk_i < chunks; chunk_i++) {
      col_pos = chunk_i * chunk_cols; // current column position;
      xfile.seekg (chunk_i * chunk_size, ios::beg);
      xfile.read (memblock_x, chunk_size);

      size_t count = xfile.gcount();
      if (!count) {
        Rprintf("\n\t Error in reading chunk %d\n", chunk_i);
        break;
      } 
      X = (double*) memblock_x;

      // loop over loaded columns and do crossprod
      #pragma omp parallel for schedule(dynamic) 
      for (j = 0; j < chunk_cols; j++) {
        for (i = 0; i < n; i++) {
          result[col_pos + j] += X[j*n+i] * y[i];
        }
      }
    }
    free(memblock_x);
    xfile.close();

  } else {
    Rprintf("Open file failed! filename = %s, chunk_size = %lu\n", xfname, chunk_size);
  }

  return result;
}

I am working on Mac OS X with clang-omp++ compiler.

The compilation information for the code is as below.

/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB -o 'sourceCpp_3.so' --preclean  'test_XTy.cpp'  
clang-omp++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include  -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/bigmemory/include" -I"/Users/yazeng/GitHub"   -fopenmp -std=c++11 -fPIC  -Wall -mtune=core2 -g -O2 -c test_XTy.cpp -o test_XTy.o
clang-omp++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o sourceCpp_3.so test_XTy.o -fopenmp -lgomp -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation

The parallel part is very simple as below. I really appreciate any suggestions. Thank you very much in advance!!!

// loop over loaded columns and do crossprod
#pragma omp parallel for schedule(dynamic) 
for (j = 0; j < chunk_cols; j++) {
   for (i = 0; i < n; i++) {
      result[col_pos + j] += X[j*n+i] * y[i];
   }
}

EDIT:

I have figured out the parallelizing cross-production part. However, it turns out the time-consuming part is the file reading (without surprise). Below is the timing for chunk-reading and cross-product computing with 1 core (series code).

Chunk 0, read start: Now time: 2016-04-14 14:55:19.000
Chunk 0, read end: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read end: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read end: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read end: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read end: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read end: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read end: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read end: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read end: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read end: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:27.000
> print(bench.cpp)
Unit: seconds
                                                                 expr      min       lq
 res.cpp1 <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 67.51534 67.51534
     mean   median       uq      max neval
 67.51534 67.51534 67.51534 67.51534     1

So as can be seen that each chunk-reading takes over 6 seconds (each chunk is ~2GB), the computation only takes 1 second even with 1 core. So parallelizing crossprod part would not speed up anyway.

So my question is then, How to also parallelize file reading?

I learned from here and here, though, that the File I/O is essentially sequential since the kernel will have to bring the disk file in sequentially anyway. So I am wondering whether there are any smart way to parallelize file-reading? Would MPI work in this case? Or are there any tips to speed up this function?

Thank you very much!

END EDIT

EDIT 2: Would GPU parallel computing work here?

I am new to GPU computing, but just think out loud here. Just wondering whether GPU computing with careful programming would bring any possible speedup here? My guess is not since disk-reading is the bottleneck. But I really wish to know if it could help.

END EDIT 2

Community
  • 1
  • 1
SixSigma
  • 2,808
  • 2
  • 18
  • 21
  • Change that `schedule(dynamic)` to `private(i) schedule(static)`. – Hristo Iliev Apr 14 '16 at 17:24
  • Thanks for the suggestion! So why ``private(i)`` but not ``private(j)``? I have the impression that parallelizing outer loop should be better, is that correct? – SixSigma Apr 14 '16 at 17:46
  • Hristo means to say that you should still be parallelizing the outer loop, but both `i` and `j` must be made `private`. OpenMP sets the loop variables being parallelized to be private automatically, so you only have to manually specify that the inner loop iterator is private. There's no harm in being explicit with `private(j,i)`. – NoseKnowsAll Apr 14 '16 at 19:56
  • Thanks, @NoseKnowsAll. I figured out the ``openmp`` part. But it turns out that doesn't speed up much because essentially the chunk-reading is the bottleneck, as described in my edit. I am now wondering whether there is a way to parallelize chunk-reading as well. – SixSigma Apr 14 '16 at 20:31
  • Indeed File IO is inherently sequential. Yes you could speed this up with MPI, but that is only speeding it up because you are using multiple machines, where each machine is working on their own sequential File IO on different parts of the file. – NoseKnowsAll Apr 14 '16 at 21:14
  • That's why the usual workflow is to profile the code with a tool like `gprof` or Score-P before trying to make it parallel. – Hristo Iliev Apr 14 '16 at 21:24
  • @NoseKnowsAll, thanks for the comment. That makes perfect sense to me. – SixSigma Apr 15 '16 at 02:05
  • @HristoIliev, you are definitely right. And it's good to know tools like ``gprof``. Thanks for pointing out it. – SixSigma Apr 15 '16 at 02:06
  • @NoseKnowsAll, oddly enough, unlike in C and C++, inner loop variables in Fortran are also predetermined as private. – Hristo Iliev Apr 15 '16 at 06:49
  • You can run a pipeline with two stages, where Stage#1 uses a dedicated thread that fetches data into memory, while Stage#2 is processing the data. It will , however, require allocating 2 large memory buffers instead of one. – alexm Apr 15 '16 at 17:20
  • @alexm thank you for the suggestion. Could you be a little more specific? I am not quite clear how that would be difference from chunk-reading and would speed up? – SixSigma Apr 15 '16 at 17:25
  • Currently you have a sequence: "read_chunk_from_file" then "compute", If there is a pipeline reading and computing can run concurrently, but on different chunks – alexm Apr 15 '16 at 17:29
  • Here is a description of Pipeline Pattern: https://msdn.microsoft.com/en-us/library/ff963548.aspx – alexm Apr 15 '16 at 17:36
  • @alexm this makes sense. But since the "computing" time is negligible compared to "chunk_read", I think the "computing" thread will be mostly idle, waiting for data from "chunk_read" thread. So that will be essentially equivalent to my code here. Am I missing anything? – SixSigma Apr 15 '16 at 17:38
  • Pipeline will allow focusing on the bottleneck (reading from a single file in your case) keeping the computational part intact. For example if there are multiple hard drives on the system you can store multiple files on different drives and have two threads readings two files concurrently... this will give you x2 speedup – alexm Apr 15 '16 at 17:46
  • @alexm yes. Yes, you are absolutely right in that case. Actually I think we even can read chunks in parallel. But my case is just an ordinary computer so cannot have the parallel File IO. Thanks a lot for your comments anyway! – SixSigma Apr 15 '16 at 17:53
  • Depending on the contents of the matrices data compression may help to reduce the file size and thus speed up disk reading. Another approach is to store the matrices in a sparse way. – user0815 Apr 15 '16 at 19:21

0 Answers0