6

I am a beginner with bash. I need some help in making this job more efficient.

while read line 
    do
        echo "$line"
        file="Species.$line"
        grep -A 1 "$line" /project/ag-grossart/ionescu/DB/rRNADB/SILVA_123.1_SSURef_one_line.fasta > $file
    done < species1

The file species contains about 100,000 species names. The file in which I am searching is 24 GB fasta (text) file.

The format of the large file is:

Domain;Phylum;Class;Order;Family;Genus;Species

AGCT----AGCT (50,000 characters per line)

Here is a sample of the species file (no empty lines in between)

Alkanindiges_illinoisensis
Alkanindiges_sp._JJ005
Alligator_sinensis
Allisonella_histaminiformans
'Allium_cepa'
Alloactinosynnema_album
Alloactinosynnema_sp._Chem10
Alloactinosynnema_sp._CNBC1
Alloactinosynnema_sp._CNBC2
Alloactinosynnema_sp._FMA
Alloactinosynnema_sp._MN08-A0205
Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum

Here is the first line of the large file:

HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina.............................................................................-UC-U-G--G-U---------------------------
(this goes one for 50,000 characters per line)

Here are some more headers:

>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
    >AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55

I need for each species a file containing all the matching sequences.

The code above works but in 16 hours it managed to do less than 2000 species.

I would like to run this in parallel to speed this up. Any other tips on making this search more efficient are welcome as well.

Thank

anishsane
  • 20,270
  • 5
  • 40
  • 73
Danny
  • 63
  • 4
  • You should specify the format of the 24GB file's lines, it will help to better answer your question – Aaron Aug 24 '16 at 09:05
  • Also, splitting the 24GB file could help with performances, and will certainly help testing the answers you'll get here. Use `split -l 10000 filename` to split it in files of 10 000 lines for example – Aaron Aug 24 '16 at 09:17
  • Hi Aaron, I edited the post to include the format of the big file. Splitting may not be so good. Since the order in the two files is not the same and the search may come up with nothing. I can create a smaller version of the files for testing purposes. – Danny Aug 24 '16 at 09:49
  • So is the output 100000 files as "file `species` contains about 100,000 species names" and "need for each species a file"? – James Brown Aug 24 '16 at 09:54
  • Once split, the final process should grep into every partial file, and while testing you could try it on a file which contains (parts) of one or two species of your knowledge. On the topic of order, doing an initial sort could make the following operations a (huge) lot quicker. However, sorting a 24GB file itself would take a huge time... – Aaron Aug 24 '16 at 09:55
  • Does the species names contain space and is the gene string space delimited from the species name as above, not semicolon delimited as the rest of the columns? – James Brown Aug 24 '16 at 10:08
  • 1
    I'd appreciate sample lines from both the species and the fasta file to test my solution below – pakistanprogrammerclub Aug 24 '16 at 10:08
  • What @pakistanprogrammerclub says but you can leave the sequence string short. – James Brown Aug 24 '16 at 10:11
  • I edited the post with sample lines from both file and some extra sample headers. The output is 100,000 files each with one or more sequence in it.There are no spaces in the species names. – Danny Aug 24 '16 at 10:49
  • The "format of the large file" has 7 semicolon separated columns, the "first line of large file" has 12. Which one is the case? – James Brown Aug 24 '16 at 10:49
  • The database contains Bacteria Archaea and Eukrya. The latter has a longer taxonomy. My species file is a subset of these using only the Bacteria. I generated by grepping all Bacteria headers from the large file and cutting the 7th field. So the matches should be one to one. – Danny Aug 24 '16 at 10:53
  • If you only want the first match, you can make a massive speedup by using `grep -m 1 ....` then you stop parsing the 24GB file as soon as you find the first match. – Mark Setchell Aug 24 '16 at 19:47
  • How long does `grep -Fwf species large.file` take? – hek2mgl Aug 26 '16 at 13:00

3 Answers3

2

A little trickier than I first thought since matched lines need to go to separate files - please post performance if you get the chance - this solution can be used in parallel too - the species list file can be chunked and/or the fasta file can be chunked and fed to parallel runs of the script

This takes about 1 minute on an Intel Xeon E5 with a 6GB fake data file checked for 10,000 species - but increasing the species list to 100,0000 even in chunks of 10,000 was problematic as I ran into disk issues with that many files being created and appended to in one directory - the problems began when the species list crossed 50,000 - this number will be different on other systems - I modified the script to create 100 subdirectories and limited each directory to 1000 files - this worked well and all 100,000 files were generated without having to chunk the species list or the 6GB datafile

Also to give you an idea of how fast grep is - it took 6 seconds to match 100,000 species in the 6GB file

specieslist=$1
nspecies=$(wc -l $specieslist|cut -f1 -d' ')
echo -e "grep $nspecies species from $specieslist\n"
grep -A1 -F -f $specieslist|
awk '
# skip context marker
/^--$/{next}
# process pair of lines
# first line is matching species header line
# species is semicolon-delimited field 7 of first line
# second line is sequence - both lines are written to a file with sanitized species name
{
  split($0, flds, ";")
  species=flds[7]
  filekey=gensub(/\W/,".","g",species)
  file="fastaout." filekey
  if(!(filekey in outfiles))  {
    outfiles[filekey]=file
    printf("species \"%s\" outfile \"%s\" first match line %d: \"%s\"\n", species, file, NR, $0)
    print >file
  }
  getline; print >>file
# close may be needed on systems where awk cannot juggle too many open files
close(outfile)
}
'
outfiles=(fastaout.*)
noutfiles=${#outfiles[*]}
echo -e "\ncreated $noutfiles fastaout.* files"
head -5 fastaout*

output and slightly modified test inputs follow - species list has some actual matches - fasta file sequence line prefixed with lowercased species to verify correctness and avoid matching species again

output

$ head out.*
==> out.Brachyspira_innocens <==
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
brachyspira_innocens.2:-UC-U-G--G-U---------------------------

==> out.Methanohalophilus_portucalensis_FDF-1 <==
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------

==> out.Pucciniomycotina <==
pucciniomycotina:-UC-U-G--G-U---------------------------

species list

Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum
Methanohalophilus_portucalensis_FDF-1
Brachyspira_innocens
Pucciniomycotina

fasta file

HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina;.............................................................................
pucciniomycotina:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
plagiochila_adiantoides:-UC-U-G--G-U---------------------------
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
citeromyces_siamensis:-UC-U-G--G-U---------------------------
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
brachyspira_pulli:-UC-U-G--G-U---------------------------
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
lampropedia_hyalina:-UC-U-G--G-U---------------------------
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
streptomyces_sp._qls01:-UC-U-G--G-U---------------------------
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
streptomyces_sp._qls04:-UC-U-G--G-U---------------------------
>AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
trichosporon_debeurmannianum:-UC-U-G--G-U---------------------------
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
streptomyces_sp._qls11:-UC-U-G--G-U---------------------------
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
clostridium_sp._pml3-1:-UC-U-G--G-U---------------------------
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
Marinobacter_sp.:-UC-U-G--G-U---------------------------
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
methylobacterium_sp._cbmb38:-UC-U-G--G-U---------------------------
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
pseudomonas_sp.:-UC-U-G--G-U---------------------------
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55
pantoea_sp._cbmb55:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.2:-UC-U-G--G-U---------------------------
  • Thanks for the code. Before trying it I want to make sure it will do what I hope. The grep command will take all the lines (and the one after) that match all the lines in the species1 file. The awk command will output each 2 lines into a separate file. But if a species has more than one match (same species different strain) these will not end up in the same file. Why would this be better/faster than my original code? – Danny Aug 24 '16 at 11:07
  • @Danny Your code has to create and execute `grep` for each and every line of your input file - that is 100,000 processes. – Mark Setchell Aug 24 '16 at 11:20
  • @MarkSetchell , I am aware of this, I was hoping to split this up into several paralel jobs - that was the initial question. – Danny Aug 24 '16 at 12:02
  • generally unix tools are designed for streaming input - this means it is often better to use them in a pipeline rather than a loop - also gnu grep has special needle in haystack optimizations that make it very fast on matching a few patterns in a very large file - together these points should make my solution orders of magnitude faster than iterating grep over one pattern at a time – pakistanprogrammerclub Aug 24 '16 at 12:30
  • please provide sample inputs for the multiple sequences per species case - I have no clue about species, sequences or strains - I'm just going by your original grep -A1 which suggests every sequence is preceded by the species header line - if that's indeed the case my solution works fine as can be seen in the test output with multiple entries in out.Brachyspira_innocens – pakistanprogrammerclub Aug 24 '16 at 13:05
  • Regarding sorting - it is often a useful pre- or post-processing step in many solutions - it's actually not that costly - gnu sort can sort 1GB of data per minute on a reasonable production machine - but many problems can be efficiently solved in a single pass in linear time without sort - with large datasets that can mean minutes versus hours or days – pakistanprogrammerclub Aug 24 '16 at 13:13
  • @pakistanprogrammerclub The first time I run the code a day ago the system blocked. I reduced the species file to 30000. Your code runs fast but it generates only 407 files each containing exactly one line which is the sequence. The fasta header The line ">name;name;name..." is not in the output file. Also I think something goes wrong with the pattern search as it tries to create an outfile based on a pattern that is not in my species file: awk: cmd. line:9: (FILENAME=- FNR=1540) fatal: can't redirect to `out.Pseudomonas_sp._'11/20CMC_control'' (No such file or directory) – Danny Aug 26 '16 at 07:55
  • Thanks for trying it out - I've fixed the problem of missing sequences with an append >> instead of overwrite > - the header is now printed before getline - the outfile name is simply the species name - that's a problem if the species name has invalid filename characters like "/" - I've sanitized the filename using "." for any nonkeyword or dot character – pakistanprogrammerclub Aug 26 '16 at 12:30
  • If you still see filename errors it has to be field 7 in some header line is not the species - look at the debug prints that show the filename constructed for each species and the header line in grep output – pakistanprogrammerclub Aug 26 '16 at 13:11
  • @pakistanprogrammerclub Now it works (with some issues). I think in the above the grep command misses the "large" file in which to search. I added that and then it works. The second issue is that it seems to ignore my species list. I downsized it to ~30000 species but the outuput (still runnig) is ~100,000 files. This seems to take all unique values from the 7th field and separate this into files. Including a fastaout. (nothing after the dot) that contains all taxa without species name i.e. empty field 7. But I can filter these files out easily. Thanks! – Danny Aug 29 '16 at 09:40
  • @pakistanprogrammerclub Another comment - now I noticed that it takes the name only from the first sequence. So it generates a file with one name and many sequences after (in the case of multiple sequences per species). – Danny Aug 29 '16 at 12:24
  • Thanks for trying it - the grep is searching stdin - the idea is to run the script as `./myscr.sh specieslistfile – pakistanprogrammerclub Aug 29 '16 at 12:24
  • you're right about the single file with multiple sequences - did you want separate files per sequence like fastaout.species.0 fastaout.species.1 ... – pakistanprogrammerclub Aug 29 '16 at 12:27
1

I would probably reach for something other than shell + grep for this.but certainly parallelizing it would be a big first step. Here's a bash4 + awk solution:

# read all 100,000 species names into a shell array
mapfile -t species <species1  

# turn the names into a single big regular expression 
regex=${species[0]}$(printf '|%s' "${species[@]:1}")

# use awk to print the matching lines into the respective files
awk -F';' '($7 ~ /^('"$regex"')$/) { print >"Species."$7 }' bigfile.txt
Mark Reed
  • 91,912
  • 16
  • 138
  • 175
0

I haven't tried AWK against such big data but I'm kind of curious:

$ cat > spec.awk
NR==FNR {       # the species file
    species[$0] # read to an array "species"
    next
} 
# below, if the beginning of the last column (until first space) is found from
# the species array, write the whole row ($0) to a file named by the species.
match($NF,/^[^ ]+/) && (beginningof=substr($NF,RSTART,RLENGTH)) && (beginningof in species) {
    print $0 > beginningof
}

$ awk -f spec.awk spec large

It reads all the species to an array and then starts matching the beginnings of the last column (string that ends in the first space) in the large file and if finds a match, writes the whole row (print $0, if you want just the last column, replace $0 with $NF) to a file named by the species, ie. this may produce you 100000 files in one directory. This fine test file below produced three files 1, 2 and 3:

$ cat large
foo;1 asd
bar;2 asd
foobar;3 asd
foo;1 asd
bar;2 asd
foo;bar;3 asd

Disclaimer: Kids, never cut, paste and execute code from the internet if you don't know how they work.

James Brown
  • 36,089
  • 7
  • 43
  • 59