0

I am trying to write a script which should work out like this below but somehow am not able to get the write way to put the syntax.

I have folders like S_12_O_319_K4me1.

While the contents are S_12_O_319_K4me1_S12816.sorted.bam in each folder.

So I wanted to write a script where my my script goes into my folder of the same name in a loop and then identifies the *.bam file and perform the operation, but I am unable to put the regex. This is what I tried:

#!/bin/bash
#$ -S /bin/bash

spp_run=/path/phantompeakqualtools/run_spp.R
bam_loc=/path/ChIP-Seq/output

samples="S_12_O_319_K27me3
S_12_O_319_K4me1
S_12_O_319_K4me3
S_12_O_319_K27ac"

for s in $samples; do

    echo "Running SPP on $s ..."
    Rscript $spp_run -c=$bam_loc/$s/${s}_S[[0-9]+\.sorted.bam -savp -out=$bam_loc/$s/${s}".run_spp.out"
done

I am not being able to recognize the digits with the above regex match.

Where am I getting it wrong?

Edit: I tried below still it does not work, problem with parsing in the Rscript, but why will this be a problem

#!/bin/bash
#$ -S /bin/bash

spp_run=/path/tools/phantompeakqualtools/run_spp.R
bam_loc=/path/ChIP-Seq/output

samples="S_12_O_319_K27me3
S_12_O_319_K4me1
S_12_O_319_K4me3"

for s in $samples; do
    echo "Running SPP on $s ..."
    echo $bam_loc/$s/${s}_S*.sorted.bam
    inbam=$bam_loc/$s/${s}_S*.sorted.bam
    echo $inbam
    Rscript $spp_run -c=$inbam -savp -out=$bam_loc/$s/${s}".run_spp.out"
done
echo "done"

Error

Error in parse.arguments(args) :
  ChIP File:/path/ChIP-Seq/output/S_12_O_319_K27me3/S_12_O_319_K27me3_S*.sorted.bam does not exist
Execution halted

Does not recognize the file even though $inbam value is /path/ChIP-Seq/output/S_12_O_319_K27me3/S_12_O_319_K27me3_S12815.sorted.bam

ivivek_ngs
  • 917
  • 3
  • 10
  • 28
  • What are you expecting is interpreting a regular expression at that location in the command? (Also you appear to be missing a closing `]` from your regex attempt.) – Etan Reisner May 12 '16 at 17:13
  • Are you just trying to glob the `${s}_S*sorted.bam` file? – Etan Reisner May 12 '16 at 17:14
  • am trying to make the Rscript pick up the bam file inside the directory `$s` having recognizing the `S_12_O_319_K4me1_S12816.sorted.bam` where regex will understand the alphanumeric `S12816` which varies for each bam files inside the folders – ivivek_ngs May 12 '16 at 23:16
  • Why do you think regular expressions are involved here at all? What would be interpreting them? You have a file in the `$bam_loc/$s` directory that starts with `$s` and ends with `.sorted.bam` and has some amount of other characters in between? That's `$bam_loc/$s/${s}_S*.sorted.bam`. Though that will match **multiple** files if more than one exists. – Etan Reisner May 13 '16 at 02:02
  • I was also expecting the same but it did not work. I tried this for the first time and it failed , that is the reason I was trying to use regex here. – ivivek_ngs May 13 '16 at 09:24
  • `Error in parse.arguments(args) : ChIP File:/path/ChIP-Seq/output/S_14_O_06_K27ac/S_14_O_06_K27ac_S*.sorted.bam does not exist` . This is what i happening so I was trying to use regex. Any suggestions @EtanReisner – ivivek_ngs May 13 '16 at 09:27
  • I actually tried it to debug using different forms or regex and then using the some echo statements and I realized it is due to the problem of the parsing in the R script, now am not sure how to work it out here. – ivivek_ngs May 13 '16 at 10:25
  • That means the glob didn't find the file you think it should. Check that the files exist where you think they do and are named the way you think they are. (A glob that fails to match is left unexpanded.) What does `ls /path/ChIP-Seq/output/S_14_O_06_K27ac/*.sorted.bam` output? – Etan Reisner May 13 '16 at 12:54
  • It does give the the entire filename with the path and the filename is coming out as `S_14_O_06_K27ac_S12828.sorted.bam`. So the glob is working in the bash but while taking it as input for the Rscript it does not. If you go in the below comments I provide the link of the Rscript. – ivivek_ngs May 13 '16 at 12:58
  • Oh! The glob is attached to the argument itself so the shell can't glob it. If you can use `-c $inbam` that should work. If not you'll need to use `inbam=($bam_loc/.....*.sorted.bam)` to have it globbed into an array variable and then you can use `-c="${inbam[0]}"` or `-c="$inbam"` (since `$arrayvar` is identical to `${arrayvar[0]}`). – Etan Reisner May 13 '16 at 13:01
  • Yes I did initialize the inbam as you mentioned and still that is not being recognized by the Rscript, so the Rscript needs a fixed variable so now I put the `infile=`echo $inbam` and passed it to `-c=$infile`. So it gets a fixed string with full path and full file name and that works. Not an elegant way but this works. – ivivek_ngs May 13 '16 at 13:25
  • Doing the array thing is identical to that but less awful (and without a sub-shell). – Etan Reisner May 13 '16 at 15:23

2 Answers2

1

You can use a regex in a find command :

export spp_run=/path/phantompeakqualtools/run_spp.R
export bam_loc=/path/ChIP-Seq/output
export dir

samples=(S_12_O_319_K27me3 S_12_O_319_K4me1 S_12_O_319_K4me3 S_12_O_319_K27ac)

for dir in ${samples[@]}; do
  find . -type f -regex ".*/*${dir}_S[0-9]+\.sorted\.bam" \
    -exec bash -c 'echo Rscript $spp_run -c=$bam_loc/${dir}/${1##*/} -savp -out=$bam_loc/${dir}/${dir}".run_spp.out"' _ {} \;
done

Note : just remove the echo before the Rscript if the output meets your needs.

SLePort
  • 15,211
  • 3
  • 34
  • 44
  • no this does not work, this should be able to run inside a bash script with qsub right? my array as 100 such directories so I need to run them in a bash script through qsub – ivivek_ngs May 12 '16 at 23:38
  • No just realized it is the problem of the Rscript, it does not accept the name as that of bash – ivivek_ngs May 13 '16 at 11:53
  • @vchris_ngs You can freely modify the `Rscript` output for your needs. – SLePort May 13 '16 at 12:04
  • It is not the output it is the name parsing. If you see my post above the directory listing and the file name parsing is working in bash but the same format is not working in the Rscript. It only accepts fullname. So am wondering what way I can modify that. The Rscript is available online https://github.com/vd4mmind/phantompeakqualtools/blob/master/run_spp.R – ivivek_ngs May 13 '16 at 12:08
  • @vchris_ngs Why not simply change the `$bam_loc` value to full path ? – SLePort May 13 '16 at 12:36
  • The `$bam_loc` is a fixed path and then inside I have folders with sample names as mentioned but inside them the `.bam` file have some extra alphanumeric character in addition to `$samples` . Ideally bash should work but the `Rscript` has some hard coded parsing, so now am trying to rename the bam files as that of the `$samples` so that it is easier. I have more than 1000 files so I do not think if I fixed the $bam_loc I can loop it and have to write the same Rscript command 1000 times. – ivivek_ngs May 13 '16 at 12:48
0

I found answer to my query and below is the code. Not an elegant one but it works. I realized that the Rscript requires full name and full path so I just initialized the output of the echo command to a variable and passed it to the Rscript as input file argument and it gets a full path with full filename so now it recognizes the input file.

Not an elegant way but still it works for me.

#!/bin/bash
#$ -S /bin/bash

spp_run=/path/tools/phantompeakqualtools/run_spp.R
bam_loc=/path/ChIP-Seq/output

samples="S_12_O_319_K27me3
S_12_O_319_K4me1
S_12_O_319_K4me3"

for s in $samples; do
    echo "Running SPP on $s ..."
    echo $bam_loc/$s/${s}_S*.sorted.bam
    inbam=$bam_loc/$s/${s}_S*.sorted.bam
    echo $inbam
    infile=`echo $inbam`
    Rscript $spp_run -c=$infile -savp -out=$bam_loc/$s/${s}".run_spp.out"
done
echo "done"

Thanks everyone for the suggestions and comments. My code is not elegant but it is working so I put the answer here.

ivivek_ngs
  • 917
  • 3
  • 10
  • 28
  • There's no need to use `echo` to set `infile`; `infile=$inbam` works as well. You can also simply use `inbam` as it is, without setting `infile` at all. – chepner May 16 '16 at 11:59
  • No it will not work in this case since the `Rscript` is designed in a way that it accepts hard coded full path name and full name of the input `.bam` file. So I had to put the command `infile=`echo $inbam` else the previous code could have also worked. It is not a problem of the bash but rather how the argument parsing is made in the `Rscript` for the input `.bam` files – ivivek_ngs May 16 '16 at 13:13
  • First, you should be quoting `$inbam`: `infile=$(echo "$inbam")`. After that, the *only* way `infile` and `inbam` can have different values is if `$inbam` contained one or more trailing newlines, which is not the case here. Rscript has nothing to do with this. – chepner May 16 '16 at 14:15