Parse file and use some of the fields as variables using the header as name in bash

Question

I have a file which first line contain a series of fields, tab separated (\t). I'm trying to walk through the lines and use some of the fields as variables for a programme. The code I have so far is the following:

    {
    A=$(head -1 id_table.txt)
read;
    while IFS='\t' read $A; 
    do
        echo 'downloading '$SRA_Sample_s
        echo $tissue_s
    #out_dir=`echo $tissue_s | sed 's/ /./g'` #Replacing spaces by dots
    #/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s 
    done 
    } <./id_table.txt

Output (Wrong):

downloading _s Inser

downloading  provided> <no

downloading  provided> <no

downloading  provided> <no

It fails because it's not getting correctly the fields. Perhaps the <> characters are creating confusion? Different files have the name of the columns ordered differently and some columns are missing in some files. I'm stuck here.

The file looks like this:

BioSample_s MBases_l    MBytes_l    Run_s   SRA_Sample_s    Sample_Name_s   age_s   breed_s sex_s   Assay_Type_s    AssemblyName_s  BioProject_s    BioSampleModel_s    Center_Name_s   Consent_s   InsertSize_l    Library_Name_s  Platform_s  SRA_Study_s biomaterial_provider_s  g1k_analysis_group_s    g1k_pop_code_s  source_s    tissue_s
SAMN02777951    4698    3249    SRR1287653  SRS607026   SL01    19  SL01    female  RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood
SAMN02777952    4451    3063    SRR1287654  SRS607028   XB01    12  XB01    male    RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood
SAMN02777953    4553    3139    SRR1287655  SRS607025   XB02    6   XB02    female  RNA-Seq <not provided>  PRJNA247712 Model organism or animal    SICHUAN UNIVERSITY  public  200 <not provided>  ILLUMINA    SRP041998    Chengdu Research Base of Giant Panda Breeding  <not provided>  <not provided>  <not provided>  blood

`IFS='\t'` hasn't worked the way you wanted. That's delimiting by `t`. Use `IFS=$'\t'` to use tabs. This is why you are getting `_s Inser`, etc. — Etan Reisner, Dec 01 '14 at 12:28
I see! that worked! Thanks. If you post it as answer I will accept it. — biojl, Dec 01 '14 at 13:06
bioji: You have to write [@username](http://stackoverflow.com/help/privileges/comment) in order to notify a user you're saying something. (e.g.: @Etan) — whoan, Dec 01 '14 at 13:13

score 3 · Answer 1 · answered Dec 01 '14 at 15:15

3

IFS='\t' hasn't worked the way you wanted. That's delimiting by t. Use IFS=$'\t' to use tabs.

This is why you are getting _s Inser, etc. (notice it starts and cuts off at the letter t).

That being said I fully agree with EdMorton that using awk for this is likely a better idea though I believe with careful quoting and the assertion that tab will not appear in the input file you can likely do this safely with just the shell (but Ed has shown me the error of my initial thoughts on more than one occasion so he may very well be thinking of things I am not).

answered Dec 01 '14 at 15:15

Etan Reisner

77,877
8
106
148

I agree, he'd probably be OK with a carefully-written (i.e. not the one he started with!) shell loop for this particular case. I was actually thinking `awk | xargs..` would be the best approach but I couldn't quite nail the xargs syntax for separating the 2 args per line! – Ed Morton Dec 01 '14 at 15:58
1

@EdMorton I think you'd need to assert that spaces weren't meaningful anywhere to use `xargs` for this and then you'd get to use something like `xargs bash -c '/path/to/cmd "$1" "${*:2}"' -` or something like that. Alternatively you might be able to have awk use `ORS='\0'` and use `xargs -n 2 -x bash -c '/path/to/cmd "$1" "$2"' -` but I'd have to try it and it would miss any non-paired line at the end (but I'm not sure that'a an issue) it might also mis-pair any other lines that missed a field. – Etan Reisner Dec 01 '14 at 16:04

Ed Morton · Accepted Answer · 2014-12-01T14:20:46.937

You may find an awk script more robust and less cumbersome to use than a shell loop:

$ cat tst.awk
BEGIN { FS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
    print "downloading", $(f["SRA_Sample_s"])
    out_dir = $(f["tissue_s"])
    gsub(/ /,".",out_dir)
    cmd = sprintf( "/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir %s --ncbi_error_report %s", out_dir, $(f["SRA_Sample_s"]) )
    print cmd
    #system(cmd); close(cmd)
}

.

$ awk -f tst.awk file
downloading SRR1287653
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287653
downloading SRR1287654
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287654
downloading SRR1287655
/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir blood --ncbi_error_report SRR1287655

I'd say you should DEFINITELY avoid the shell loop if it wasn't for you calling an external command and so doing more than just text processing.

Alterantively, consider using awk for the text processing and then piping to a shell loop for the external command execution:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==1 { for (i=1; i<=NF; i++) f[$i]=i; next }
{
    gsub(/ /,".",$(f["tissue_s"]))
    print $(f["tissue_s"]), $(f["SRA_Sample_s"])
}

.

$ awk -f tst.awk file |
while IFS=$'\t' read -r out_dir SRA_Sample_s
do
    printf 'downloading %s\n' "$SRA_Sample_s"
    #/soft/bio/sequence/sratoolkit-2.3.4-2/bin/fastq-dump.2.3.4 --split-3 --outdir $out_dir --ncbi_error_report $SRA_Sample_s 
done
downloading SRR1287653
downloading SRR1287654
downloading SRR1287655

It wasn't me. The error found by @EtanReisner made my code work. I'm indeed calling an external command :) — biojl, Dec 01 '14 at 14:26
The problem with your approach though is that it leaves you exposed to several shell behaviors that are undesirable for text processing like globbing, word splitting, file name expansion, etc. so you could get highly unexpected and undesirable results based on the contents of your input file. That's why I suggested you use awk for the text processing part as it has none of those issues. If your input file is guaranteed to ALWAYS just contain letters and numbers though (ie no RE or globbing metacharacters like . * ? etc.) then you'll probably be OK. — Ed Morton, Dec 01 '14 at 14:33

score 1 · Answer 3 · answered Dec 01 '14 at 14:05

1

try (based on your style of development)

cat id_table.txt \
 | {
   read Header

   while eval "read ${Header}"
    do
      echo "Donwloading ${SRA_Sample_s}"
      echo "${tissue_s}"
    done
   }

answered Dec 01 '14 at 14:05

NeronLeVelu

9,908
1
23
43

1

Here I see an unnecessary use of `cat`, `eval`, and braces (in the variable names). I agree with you in the way you're getting the header. – whoan Dec 01 '14 at 14:10
It doesn't work as you wrote it. I fixed it adding IFS=$'\t' between while and eval in your code. I like the way you retrieve header, better than mine. – biojl Dec 01 '14 at 14:33
it is very sensitive to separator (tab or space). lot of issue when testing with a copy of your sample that put some tab and space during copy/paste). – NeronLeVelu Dec 01 '14 at 15:06

Parse file and use some of the fields as variables using the header as name in bash

3 Answers3