0

I'm trying to make pileup files using samtools from two files, File1 and File2.

I have split up File1 and File2 by chromosome, resulting in having 44 files named following the format:

chr${c}.${TISSUE}_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY

where ${c} is a number between 1 and 22, and $TISSUE is either colon or muscle--22 chromosomes for colon, and 22 for muscle. I.e.; chr1.colon_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY chr2.colon_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY

.
.
.

chr22.colon_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY
chr1.muscle_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY
.
.
.

These files consist of two columns, the first just shows the chromosome number, and the second column is a position on that chromosome. I.e.;

head chr2.colon_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY 
chr2 103977
chr2 112051
chr2 126199
chr2 146288
chr2 147797
chr2 147822
chr2 148548
chr2 148525
chr2 158189
chr2 158188

For each row in the file (for example, "chr2.colon_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY"), I need to take the position, call it 'x', from column 2, and use it to get a range of a-b, where a=x-5 and b=x+5. I'll then plug those values into the following script:

samtools mpileup -f [REFERENCE GENOME] File1 File2 -r chr${c}:a-b

For example, suppose I'm looking at chromosome 2, position 103977 (row 1 above). Then my script would be

samtools mpileup -f [REFERENCE GENOME] File1 File2 -r chr2:103972-103982

So basically, it's a loop within a loop within a loop. Something like,

for t in $(colon, muscle)
do
  for c in $seq (1 22)
  do
    for item (or maybe row?) in 
      chr${c}.${t}_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY
    do
      awk '{print $2}' | something something something 
      x= position in col 2, a=x-5 b=x+5
      samtools mpileup -f [REFERENCE GENOME] File1 File2 -r chr${c}:a-b
    done 
  done
done
...

Thanks in advance. I'm brand new to working with Linux and I have essentially no computer science training.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Emm Gee
  • 165
  • 7
  • 2
    Hi, please use edit and use the Code(curly brackets) feature on the post editor for readability. The question is unreadable, please organize it. I recommend you to look on other questions to learn how to write a proper question. GL :) – Blacky Dec 31 '17 at 20:57

2 Answers2

1

Awk processes a line at a time, so I'd go for something like

for t in colon muscle; do
    for c in $(seq 1 22); do
        awk '{ print $2-5 "-" $2+5 }' chr${c}.${t}_run1_en2hic_PE1.bam.sorted.bam.breaks_COL1_and_COL2_ONLY |
        while read -r range; do
            samtools mpileup -f [REFERENCE GENOME] File1 File2 -r chr${c}:$range
        done 
    done
done

In other words, Awk processes the whole file and feeds one line of output at a time to the final while read -r range loop.

I don't understand how you split these files in the first place, or what a pileup is, but I suspect this could be simplified considerably if you just worked directly on File1 and File2 instead.

You could probably also avoid the outer loops and just run Awk on all the *_ONLY files directly. You can get the current file name from Awk's internal variable FILENAME but in this case you can apparently just use the first field.

awk '{ print $1 ":" $2-5 "-" $2+5 }' *_ONLY |
while read -r chrrange; do
    samtools mpileup -f [REFERENCE GENOME] File1 File2 -r "$chrrange"
done

If you can't use $1 directly, try split(FILENAME, f, /\./) and print f[1] to get the chromosome identifier part from the filename.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

This is what ended up working for me:

module load SAMtools

awk '{print $1, $2-5 "-" $2+5}' FILE PATH |\
while read chrom range
do

    samtools mpileup -f /REFERENCE GENOME\
            /${chrom}.COLON BAM FILE\
            /${chrom}.MUSCLE BAM FILE\
            -r $chrom:$range -o ${chrom}.colon.${range}.pileup

done

Thank you for your help!

Emm Gee
  • 165
  • 7