I have a file with accession numbers. Those numbers needs to be mapped against IDs into another file, and with that information and complementary mysql database information write a third file. I have a simple program which reads the file (145Gb), extract the accession number and then use grep command to find the corresponding ID into mapping file (10Gb). So for each accession number I'm performing one grep:
$ grep -m1 myAccession myMappFile
This operation is performed several times. As I'm accessing the same file over and over again, I'll like to know if there is a simple way to create an index or some sort of bash magic that allows to improve the performance since I've to process around 45Million of accessions. I've processed 250k accessions on ~3h. therefore process the 45M will take around ~540h (22 days!!) which is not affordable... Im aware that I can have some improve sending one grep with multiple accessions:
$ grep 'accession1\|accession2\|accession3' -m3 myMappFile
However this is not enough.
Maybe something like:
$ grep 'accession1\|accession2\|accession3' -m3 myIndexedMappFile
Note: the database process is already improved and I've drastically reduce the database access by using a hashmap so the bottleneck for sure is located on the grep.
Any ideas?
Update:
*File with accession:*
>Accession_A other text
other line
...
...
>Accession_B more text
more lines
...
*File with mappings*
Col1 Accession_A ID-X Col4
Col1 Accession_B ID-Y Col4
...
...
So the program reads the Accession file (line by line) extract the Accession_N, then grep for that accession on the mapping file. With the resultant row, I extract the ID value and with that ID I search for more data into a database, so at the end I have a file with:
Accession_A ID-X DB-DATA
Accession_B ID-Y DB-DATA
No file is sorted. I put the values {ID, DB-DATA} Into hash map to avoid DB overhead.
The program is coded with java an use Process to exec grep command, to reduce overhead of Runtime.exec calls I've try running grep with multiple accessions at once but it is almost the same...