4

I have a file with accession numbers. Those numbers needs to be mapped against IDs into another file, and with that information and complementary mysql database information write a third file. I have a simple program which reads the file (145Gb), extract the accession number and then use grep command to find the corresponding ID into mapping file (10Gb). So for each accession number I'm performing one grep:

$ grep -m1 myAccession myMappFile

This operation is performed several times. As I'm accessing the same file over and over again, I'll like to know if there is a simple way to create an index or some sort of bash magic that allows to improve the performance since I've to process around 45Million of accessions. I've processed 250k accessions on ~3h. therefore process the 45M will take around ~540h (22 days!!) which is not affordable... Im aware that I can have some improve sending one grep with multiple accessions:

$ grep  'accession1\|accession2\|accession3' -m3 myMappFile

However this is not enough.

Maybe something like:

$ grep  'accession1\|accession2\|accession3' -m3 myIndexedMappFile

Note: the database process is already improved and I've drastically reduce the database access by using a hashmap so the bottleneck for sure is located on the grep.

Any ideas?

Update:

*File with accession:*
>Accession_A other text
other line
...
...
>Accession_B more text
more lines
...

*File with mappings*
 Col1  Accession_A   ID-X  Col4
 Col1  Accession_B   ID-Y  Col4
 ...
 ...

So the program reads the Accession file (line by line) extract the Accession_N, then grep for that accession on the mapping file. With the resultant row, I extract the ID value and with that ID I search for more data into a database, so at the end I have a file with:

Accession_A ID-X DB-DATA

Accession_B ID-Y DB-DATA

No file is sorted. I put the values {ID, DB-DATA} Into hash map to avoid DB overhead.

The program is coded with java an use Process to exec grep command, to reduce overhead of Runtime.exec calls I've try running grep with multiple accessions at once but it is almost the same...

codeforester
  • 39,467
  • 16
  • 112
  • 140
jcoder8
  • 163
  • 1
  • 9
  • not sure, but this might help: https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash – Sundeep Aug 16 '17 at 08:27
  • Could you provide a more complete [example](https://stackoverflow.com/help/mcve) – Thor Aug 16 '17 at 08:39
  • thanks @Sundeep but different use case. – jcoder8 Aug 16 '17 at 09:01
  • Sure @Thor I'm working on it – jcoder8 Aug 16 '17 at 09:04
  • 1
    @Sundeep The link that you posted definitely helps! instead of have running all the steps on the same program, first I cleaned the accession file, then mapped it against the mapping file with $grep -F -f cleaned.file map.file > match.file and now the java program only reads a flat file and complements the info with the DB, which is more faster...thanks! – jcoder8 Aug 16 '17 at 11:34
  • @jcoder8: Either post a detailed answer or delete the question please – Thor Aug 17 '17 at 06:12

1 Answers1

0

I've worked around on @sundeep's suggestion and found a solution in terms of the processing time, however I think that still should be a better way to improve use cases when the user needs to perform several greps on the same file. What I did was:

First extract all the accession numbers from the first file:

grep -e "^>" myBigFile.fa | cut -d">" -f2 | cut -d" " -f1   > all_accession.txt

Then use the grep with file reference

grep -F -f all_accession.txt myBigMappingFile > matchFile.txt

Finally use the java program to process the matchFile.txt in order to read the ID and create the target file (with process I mean just read the ID and look into the DB for the complementary information).

Those three steps are performed on 3.5hr which is more acceptable. However the solution is not complete since running all together (as I've been trying since the beginning) also generates other output files, the most important, a file with the accessions that doesn't have a corresponding id on the mapping file, therefore I've try to use the following command to generate that file:

grep -F -v -f all_accession.txt myBigMappingFile > matchFile.txt

grep with -v param in order to invert the selection, but that command also gives the records on the myBigMappingFile that doesnt find a match on all_accession.txt file which is not the desired output....

jcoder8
  • 163
  • 1
  • 9