I need to compute the cross-product of a huge matrix X
and a vector y
. X
has n=2500
rows and p=1000,000
columns. y
is 1000-by-1 vector. Since X
file is very huge and cannot be loaded into RAM, I divided it into 10 chunks, each with 2500 rows and chunk_cols = 100,000
columns. Then I read the data chunk-by-chunk into memory and do crossprod column-by-column.
I should also mention that X
file is preprocessed into a binary file, which stores the matrix column-by-column into a long long vector. So in the binary file, the first (double) element is X[0,0]
, the second is X[1,0]
, so on and so forth.
I tried to parallelize the crossprod part with Opemmp
, hoping to speed up the computation. But it turns out that parallelization is much worse, 50 seconds slower.
4 cores:
Unit: seconds
expr min lq
res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 4) 122.5716 122.5716
mean median uq max neval
122.5716 122.5716 122.5716 122.5716 1
1 core:
Unit: seconds
expr min lq
res.cpp <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 72.56355 72.56355
mean median uq max neval
72.56355 72.56355 72.56355 72.56355 1
Below is my code.
// [[Rcpp::export]]
NumericVector crossprod_cpp(SEXP filename, int chunk_cols, const std::vector<double>& y,
int n, int p, int useCores) {
NumericVector result(p);
unsigned long int chunk_size = chunk_cols * sizeof(double) * n;
const char *xfname = CHAR(Rf_asChar(filename));
ifstream xfile(xfname, ios::in|ios::binary);
int i, j;
if (xfile.is_open()) {
streampos size_x;
char *memblock_x;
int chunk_i = 0;
int chunks = 0;
int col_pos = 0;
xfile.seekg (0, ios::end);
size_x = xfile.tellg();
xfile.seekg (0, ios::beg);
chunks = size_x / chunk_size;
double *X;
omp_set_dynamic(0);
omp_set_num_threads(useCores);
memblock_x = (char *) calloc(chunk_size, sizeof(char));
for(chunk_i = 0; chunk_i < chunks; chunk_i++) {
col_pos = chunk_i * chunk_cols; // current column position;
xfile.seekg (chunk_i * chunk_size, ios::beg);
xfile.read (memblock_x, chunk_size);
size_t count = xfile.gcount();
if (!count) {
Rprintf("\n\t Error in reading chunk %d\n", chunk_i);
break;
}
X = (double*) memblock_x;
// loop over loaded columns and do crossprod
#pragma omp parallel for schedule(dynamic)
for (j = 0; j < chunk_cols; j++) {
for (i = 0; i < n; i++) {
result[col_pos + j] += X[j*n+i] * y[i];
}
}
}
free(memblock_x);
xfile.close();
} else {
Rprintf("Open file failed! filename = %s, chunk_size = %lu\n", xfname, chunk_size);
}
return result;
}
I am working on Mac OS X with clang-omp++
compiler.
The compilation information for the code is as below.
/Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB -o 'sourceCpp_3.so' --preclean 'test_XTy.cpp'
clang-omp++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/bigmemory/include" -I"/Users/yazeng/GitHub" -fopenmp -std=c++11 -fPIC -Wall -mtune=core2 -g -O2 -c test_XTy.cpp -o test_XTy.o
clang-omp++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o sourceCpp_3.so test_XTy.o -fopenmp -lgomp -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
The parallel part is very simple as below. I really appreciate any suggestions. Thank you very much in advance!!!
// loop over loaded columns and do crossprod
#pragma omp parallel for schedule(dynamic)
for (j = 0; j < chunk_cols; j++) {
for (i = 0; i < n; i++) {
result[col_pos + j] += X[j*n+i] * y[i];
}
}
EDIT:
I have figured out the parallelizing cross-production part. However, it turns out the time-consuming part is the file reading (without surprise). Below is the timing for chunk-reading and cross-product computing with 1 core (series code).
Chunk 0, read start: Now time: 2016-04-14 14:55:19.000
Chunk 0, read end: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Crossprod start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read start: Now time: 2016-04-14 14:55:27.000
Chunk 1, read end: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:33.000
Crossprod start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read start: Now time: 2016-04-14 14:55:34.000
Chunk 2, read end: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Crossprod start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read start: Now time: 2016-04-14 14:55:40.000
Chunk 3, read end: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Crossprod start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read start: Now time: 2016-04-14 14:55:47.000
Chunk 4, read end: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Crossprod start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read start: Now time: 2016-04-14 14:55:53.000
Chunk 5, read end: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Crossprod start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read start: Now time: 2016-04-14 14:56:00.000
Chunk 6, read end: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:06.000
Crossprod start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read start: Now time: 2016-04-14 14:56:07.000
Chunk 7, read end: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Crossprod start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read start: Now time: 2016-04-14 14:56:13.000
Chunk 8, read end: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Crossprod start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read start: Now time: 2016-04-14 14:56:20.000
Chunk 9, read end: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:26.000
Crossprod start: Now time: 2016-04-14 14:56:27.000
> print(bench.cpp)
Unit: seconds
expr min lq
res.cpp1 <- crossprod_cpp(xfname, chunk_cols, y, n, p, useCores = 1) 67.51534 67.51534
mean median uq max neval
67.51534 67.51534 67.51534 67.51534 1
So as can be seen that each chunk-reading takes over 6 seconds (each chunk is ~2GB), the computation only takes 1 second even with 1 core. So parallelizing crossprod part would not speed up anyway.
So my question is then, How to also parallelize file reading?
I learned from here and here, though, that the File I/O is essentially sequential since the kernel will have to bring the disk file in sequentially anyway. So I am wondering whether there are any smart way to parallelize file-reading? Would MPI work in this case? Or are there any tips to speed up this function?
Thank you very much!
END EDIT
EDIT 2: Would GPU parallel computing work here?
I am new to GPU computing, but just think out loud here. Just wondering whether GPU computing with careful programming would bring any possible speedup here? My guess is not since disk-reading is the bottleneck. But I really wish to know if it could help.
END EDIT 2