1

I'm trying to extract a particular column (4th column) from multiple (over thousands) CSV files and write it new file, in the same manner, the file is sequenced in a folder and first CSV file would provide all four column from the file. Now I'm wondering how I can make the output file's column names match the names of the input CSV files.

CSV files names are like

> EE85723.R.csv
> EE85727.R.csv
> EE87894.R.csv
> .......
> .......
>

> head EE85723.R.csv  
chr,start,end,copy 
chr1,1,10000,0
chr1,10001,20000,3.09651371393489 
chr1,20001,30000,1.91248096145222
chr1,30001,40000,1.96470746277162 
> 
> head EE85727.R.csv 
chr,start,end,copy  
chr1,1,10000,0
chr1,10001,20000,3.02441583128188 
chr1,20001,30000,1.87088110683025
chr1,30001,40000,1.94510909384639 
> 
> head EE87894.R.csv 
chr,start,end,copy 
chr1,1,10000,0
chr1,10001,20000,3.06941544044942 
chr1,20001,30000,1.83912070977027
chr1,30001,40000,1.90006068018602

#!/bin/bash
rm -f out.csv
set -- *R.csv

cut -d , -f 1,2,3 -- "$1" >out.csv

for file do
    cut -d , -f 4 -- "$file" | paste -d , out.csv - >out.tmp &&
    mv out.tmp out.csv
done

The output file I obtained had every column named "copy," making it difficult to determine which column is from which file.

> chr,start,end,copy,copy,copy
> chr1,1,10000,0,0,0
> chr1,10001,20000,3.02441583128188,3.06941544044942,3.09651371393489
> chr1,20001,30000,1.87088110683025,1.83912070977027,1.91248096145222
> chr1,30001,40000,1.94510909384639,1.90006068018602,1.96470746277162
> chr1,40001,50000,0.576139127131562,0.588528490660998,0.635347605084456
> chr1,50001,60000,1.51250200836185,1.50849932321034,1.52994133230921
> chr1,60001,70000,0.681365714967938,0.676156428892953,0.699545565388925
> chr1,70001,80000,0.436354857763045,0.449640001550081,0.497235183366175
> chr1,80001,90000,1.05269567207548,1.04655014589231,1.06732707247313
> 

The Expected outcome would be like this where every column header name should represent the file name without the extension.

> chr,start,end,EE85723.R,EE85727.R,EE87894.R

It will be an immense help if someone kindly suggests a potential way to proceed to overcome this issue in Bash.

  • That's a tricky one. But for starters, why boter with `>out.tmp`. Skip the `mv out.tmp out.csv` and just use `>> out.csv` .Then I would add another loop before the data exraction that creates the header you need. `printf "chr,start,end" > out.csv; for file ; do printf ",${file%.*} >> out.csv; done; printf "\n">> out.csv` and then let your code starting at the `for file` take over. Don't have data to test this with so Good luck. – shellter May 04 '23 at 01:09

1 Answers1

2

Here is a potential solution using awk.

Example files:

tail EE*.R.csv
==> EE85723.R.csv <==
chr,start,end,value
chr1,1,10000,0
chr1,10001,20000,3.02441583128188
chr1,20001,30000,1.87088110683025
chr1,30001,40000,1.94510909384639
chr1,40001,50000,0.576139127131562
chr1,50001,60000,1.51250200836185
chr1,60001,70000,0.681365714967938
chr1,70001,80000,0.436354857763045
chr1,80001,90000,1.05269567207548

==> EE85727.R.csv <==
chr,start,end,value
chr1,1,10000,0
chr1,10001,20000,3.06941544044942
chr1,20001,30000,1.83912070977027
chr1,30001,40000,1.90006068018602
chr1,40001,50000,0.588528490660998
chr1,50001,60000,1.50849932321034
chr1,60001,70000,0.676156428892953
chr1,70001,80000,0.449640001550081
chr1,80001,90000,1.04655014589231

==> EE87894.R.csv <==
chr,start,end,value
chr1,1,10000,0
chr1,10001,20000,3.09651371393489
chr1,20001,30000,1.91248096145222
chr1,30001,40000,1.96470746277162
chr1,40001,50000,0.635347605084456
chr1,50001,60000,1.52994133230921
chr1,60001,70000,0.699545565388925
chr1,70001,80000,0.497235183366175
chr1,80001,90000,1.06732707247313

The script:

cat test.sh
BEGIN {
    FS = OFS = ","
}

NR == FNR && FNR == 1 {
    sub(".csv", "", FILENAME)
    a[FNR] = $1 OFS $2 OFS $3 OFS FILENAME
}

NR == FNR && FNR > 1 {
    a[FNR] = $0
}

NR > FNR && FNR == 1 {
    sub(".csv", "", FILENAME)
    a[FNR] = a[FNR] OFS FILENAME
}

NR > FNR && FNR > 1 {
    a[FNR] = a[FNR] OFS $4
}

END {
    for (i = 1; i <= FNR; i++) {
        print a[i]
    }
}

Running the script:

awk -f test.sh EE*.R.csv > output.csv
cat output.csv
chr,start,end,EE85723.R,EE85727.R,EE87894.R
chr1,1,10000,0,0,0
chr1,10001,20000,3.02441583128188,3.06941544044942,3.09651371393489
chr1,20001,30000,1.87088110683025,1.83912070977027,1.91248096145222
chr1,30001,40000,1.94510909384639,1.90006068018602,1.96470746277162
chr1,40001,50000,0.576139127131562,0.588528490660998,0.635347605084456
chr1,50001,60000,1.51250200836185,1.50849932321034,1.52994133230921
chr1,60001,70000,0.681365714967938,0.676156428892953,0.699545565388925
chr1,70001,80000,0.436354857763045,0.449640001550081,0.497235183366175
chr1,80001,90000,1.05269567207548,1.04655014589231,1.06732707247313
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • Hi, This awk script works well, but I found that the sorting of the output file was messed up. – Debajyoti Kabiraj May 06 '23 at 22:32
  • Oh sorry @DebajyotiKabiraj - I think this more-complicated approach isn't worth the trouble? I'll edit my answer with a simpler approach that shouldn't have a problem with values getting 'out of order'. – jared_mamrot May 07 '23 at 23:56
  • Hi @jared_mamrot, Thank you very much, now it is working well. However, when I'm trying to run it through the command line using input and output
    starting from
     {
        FS = OFS = ","
    } 
    
     >awk -f combine4.sh *.R.csv > Out.Test3.csv 
     chr,start,end,copy,,,EE87924.R,EE87926.R,EE88055.R
       
    One "copy" term comes before all the entries. Can you please suggest how to correct it.
    – Debajyoti Kabiraj May 08 '23 at 15:10
  • Interesting; I'm not able to reproduce the issue with the three example files shown in my answer (i.e. `awk -f test.sh *.R.csv` doesn't print copy in the header). Can you please edit your question and include a minimal example to reproduce the problem (e.g. show example input files, your script, the way you run the script (`awk -f script files`) and the expected output)? – jared_mamrot May 08 '23 at 23:27