2

So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:

# rsid  chromosome  position    genotype
rs4477212   1   82154   AA
rs3094315   1   752566  AG
rs3131972   1   752721  AG
rs12124819  1   776546  AA
rs11240777  1   798959  AG
rs6681049   1   800007  CC
rs4970383   1   838555  AC
rs4475691   1   846808  CT
rs7537756   1   854250  AG
rs13302982  1   861808  GG
rs1110052   1   873558  TT
rs2272756   1   882033  GG
rs3748597   1   888659  CT
rs13303106  1   891945  AA
rs28415373  1   893981  CC
rs13303010  1   894573  GG
rs6696281   1   903104  CT
rs28391282  1   904165  GG
rs2340592   1   910935  GG

The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);

            // Do something with genotype

        }   

    }   

NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:

    $rawData = file_get_contents('genome_file.txt');
    $rsids = $this->get_snps();

    $query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
    $stmt = $ngdb->prepare($query);

    $stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);

    $ngdb->query("START TRANSACTION");

    while ($row = $rsids->fetch_assoc()) {

        $searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";

        if (preg_match($searchPattern,$rawData,$matchedGene)) {

            $genotype = $matchedGene[3]);

            $stmt->execute();
            $insert++;

        }   


    }   

    $stmt->close();
    $ngdb->query("COMMIT");

    $snps->free();
    $ngdb->close();

}

So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.

Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?

Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:

    $rsids = get_rsids(); 

    $rsid_group = implode('|',$rsids);
    $pattern =  "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)\n~i";
    preg_match($pattern,$rawData,$matches);

But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?

So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.

Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.

EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.

Dominic Fagan
  • 151
  • 2
  • 11
  • What about just selecting the `rsids` you need? `select genotype from table where rsid in('rs4477212', 'etc')` put all 10000 ids in that `in`. – chris85 Oct 08 '16 at 16:11
  • `LOAD DATA INFILE` would be much faster and it's configurable. You could then `DELETE * FROM tale WHERE rsid NOT IN (1,2,3)` etc... – AbraCadaver Oct 08 '16 at 16:19
  • Can you show some examples of your `$rawData` value? Please [edit] your question to show that. The secret to improving speed for your program will be eliminating, or drastically reducing, the use of regular expressions. – O. Jones Oct 08 '16 at 16:21
  • The $rawData variable contains the raw genome file. I just use file_get_contents($genome_file) to load the whole file into a variable. – Dominic Fagan Oct 08 '16 at 16:24
  • Also, I do select only the rsids I need. But the genotype is going to be unique for each file (every person has unique DNA), so I need to load the genotypes from every persons genome file into the database. – Dominic Fagan Oct 08 '16 at 16:26
  • read file, then find.... simple and faster: `exec` with awk and load only finded data – Alex Muravyov Oct 08 '16 at 16:35
  • Ahhh. You molecular geneticists! You think every problem is a DNA annealing problem :-) ... just dump a bunch of radiotagged `rsxxxx` fragments into your raw data and see what sticks. That's what regexes do. But annealing is a massively parallel chemical process, and you're using a serialized computer language. :-) :-) – O. Jones Oct 08 '16 at 16:49

3 Answers3

1

This is slow because you're searching a vast array of data over and over again.

It looks like you have a text file, not a dbms table, containing lines like these:

rs4477212   1   82154   AA
rs3094315   1   752566  AG
rs3131972   1   752721  AG
rs12124819  1   776546  AA

It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.

I think you want exact matches for the rsxxxx values, not prefix or partial matches.

I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.

So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.

  1. Read your rows of rsid values from the dbms, just once, and store them in an associative array.
  2. for each file of raw data....
    1. for each line of data in the file...
      1. split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.
      2. Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.
      3. If the key does exist, you have a match.
      4. extract the last column from the raw text line ('GC or whatever)
      5. write it to your dbms.

Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.

To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries

Community
  • 1
  • 1
O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • +1 The only other idea I had, if it were important to avoid keeping a hash table of rsids in memory, is to ensure that both the rsid query and the input genome file are pre-sorted in rsid order. Then one could write code to read both row by row, and increment one or the other if the rsid doesn't match. – Bill Karwin Oct 08 '16 at 16:59
  • Brilliant! It reduced the time taken to get genotypes by about 10X. – Dominic Fagan Oct 08 '16 at 17:51
1

+1 to @Ollie Jones' answer. He posted while I was working on my answer. So here's some code to get you started.

$rsids = $this->get_snps();
while ($row = $rsids->fetch_assoc()) {
    $key = 'rs' . $row['rsid'];
    $rsidHash[$key] = true;
}

$rawDataFd = fopen('genome_file.txt', 'r');
while ($rawData = fgetcsv($rawDataFd, 80, "\t")) {
    if (array_key_exists($rawData[0], $rsidHash)) {
        $genotype = $rawData[3];
        // do something with genotype
    }
}
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
1

I wanted to give the LOAD DATA INFILE approach to see how well that works, so I came up with what I thought is a nice elegant approach, heres the code:

    $file = 'C:/wamp/www/nutri/wp-content/plugins/genomics/genome/test';

    $data_query = "
    LOAD DATA LOCAL INFILE '$file' 
    INTO TABLE wp_genomics_results 
    FIELDS TERMINATED BY '\t' 
    LINES TERMINATED BY '\n' 
    IGNORE 18 ROWS 
    (@rsid,@chromosome,@locus,@genotype) 
    SET file_id = '$file_id', 
        snp_id = (SELECT id FROM wp_pods_snp WHERE rsid = SUBSTR(@rsid,2)), 
        genotype = @genotype
    ";

    $ngdb->query($data_query);

I put a foreign key restraint on the snp_id (thats the ID for my table of RSIDs) column so that it only enters genotypes for rsids that I need. Unfortunately this foreign key restraint caused some kind of error which locked the tables. Ah well. It might not have been a good approach anyhow since there are on average 200,000 rows in each of these genome files. I'll go with Ollie Jones approach since that seems to be the most effective and viable approach I've come across.

Dominic Fagan
  • 151
  • 2
  • 11
  • That was worth trying. As it happens, MySQL's foreign key constraint system isn't optimized to deal with large volumes of failed constraint checks.But it's hard to know that without trying things. There are very few things in modern programming languages more efficient than in-memory hash tables (aka associative arrays, aka dictionaries). Language runtime developers compete with each other to eke out two- and three- machine cycle improvements in this stuff. So, the rest of us may as well grab the benefits. – O. Jones Oct 10 '16 at 15:56