0

test.fa.gz file contains multiple 4 lines as blow:

@HWI-ST298:420:B08APABXX:3:1101:1244:2212 1:N:0:TCATTC
GGCAAGGCACTTACTTTACAGCTAAAGAAGTGCAGC
+
@@@FDFFDFHCFDACGHC<<CCFEHHFCCFCEE:C?

What I want to do is to read every four lines of *.fq.gz file in parallel with OpenMP. The code blow could be compiled successfully, but will show incorrect results sometimes. In each for loop, I used 4 times of getline() to read the file. I'm not sure how OpenMP will handle the multiple jobs in each for loop and how the .gz file handle will move between for loops of OpenMP.

I've searched internet and OpenMP documents for help, but still don't quite get it. So any help will be appreciated.

Thanks,


#include <iostream>
#include <string>
#include <cstdlib>
#include <gzstream.h>
#include <omp.h>
using namespace std;

string reverseStrand (string seq);

int main (int argc, char ** argv) {
    const char* gzFqFile;
    unsigned int nReads;

    if (argc == 3) {
        gzFqFile = argv[1];
        nReads   = atoi(argv[2]); }
    else {
        printf("\n%s <*.fq.gz> <number_of_reads>\n", argv[0]);
        return 1; }

    igzstream gz(gzFqFile);
    string li, bp36, strand, revBp36;
    unsigned int i;
    #pragma omp parallel shared(gz) private(i,li,bp36,strand,revBp36)
    {
        #pragma omp for schedule(dynamic)
        for(i = 0;i < nReads;++i) {
            li      = "";
            bp36    = "";
            strand  = "";
            revBp36 = "";
            getline(gz,li,'\n');
            getline(gz,li,'\n');
            bp36 = li;
            getline(gz,li,'\n');
            strand = li;
            getline(gz,li,'\n');
            if(strand.compare("-") == 0) {
                revBp36 = reverseStrand(bp36);
            }
            cout << bp36 << " " << strand << " " << revBp36 << "\n";
        }
    }
    gz.close();
}
user1465767
  • 41
  • 1
  • 6
  • 2
    You can't read in parallel from _compressed_ files, even if the original uncompressed content consists of equally sized records. You must serialise the I/O with critical sections and you would essentially loose the benefits of parallel processing. Read as much as you can from the file in a single thread and then process in parallel what you have read, then repeat until EOF. – Hristo Iliev Jun 19 '12 at 10:34

2 Answers2

3

More of an extended comment than an answer perhaps but here goes anyway ...

Even if getline were thread safe it's probably not a good idea to have multiple threads in an OpenMP program all trying to read the same file simultaneously. Unless you have a parallel file system (since you don't mention it I assume you don't) you run the risk of writing a program in which the threads fight each other for the single I/O channel. Consider the case of 4 threads each reading different parts of a file all using 1 read/write head on a disk. Quasi-random reading of small bits of a file is probably the slowest approach you could think of.

Haatschi's suggestion, of wrapping the file access in a critical section, will simply mean that instead of fighting for I/O access the threads play nicely together, each waiting politely for its turn. But, as Haatschi suggests, this is not likely to lead to any speedup in file reading, more likely (in my experience) to lead to a slow down. If I/O time is not critical this might be a way to go.

If you are concerned with I/O time then either read the file in one thread and parallelise the processing of the data; or, have each of the threads read all their data in one gulp from the file, using critical sections to avoid contention for I/O resources

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • Even if he had a parallel file system, he is reading from a _compressed_ file which is inherently undoable in parallel (unless the file is compressed in blocks and an index is provided...) – Hristo Iliev Jun 19 '12 at 10:29
2

The getline function is not thread safe. Therefore you cannot call it from different threads simultaneously without getting undefined behavior. The only way to do this properly is to create critical sections around each getline call, forcing that only one thread calls getline on "gz" at any time. However in you code example I doubt, there will be any speedup using more than one thread, because there is not much work to do for the thread other than reading line from "gz".

Haatschii
  • 9,021
  • 10
  • 58
  • 95
  • If getline() won't speedup, is there any other function that can speedup reading file with OpenMp? Thanks. – user1465767 Jun 19 '12 at 09:05
  • I don't think there is a possible speedup using more than one thread to simply read a file into ram. However if you work with the data you read of course there can be. In this case it would make sence to put the getline in a critical section and have the "work" done in paralell. If your "reverseStrand" requires a lot of computation time, you can try to do it this way. – Haatschii Jun 19 '12 at 09:11