0

At first, this thread might look related to genetics, but the problem is actually shell scripting and programming based. I am new to coding, so I was suggested to find a help in SO.

I try to intersect NCBI GTF files with PLINK tped files with purpose to switch chromosomal location to gene location in tped files (files with identificator, positions and nucleotides)

So, I did following steps:

1) Got gtf from 

 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

(as the file is large and complex, the best way to look at file structure is by downloading it)

2) unzipped and extracted only genes annotation with 

    awk -F"\t" '$3=="gene"' GCF_000001405.39_GRCh38.p13_genomic.gtf > aaa.gtf

3) converted NC chromosome symbols from NC_xxx to X with 

    sed -r 's/^NC_[0]*//;s/\.[0-9]\+//;/^N[WT]_|12920/d' aaa.gtf > bbb.gtf

4) converted to bed format (file with positions)

gtf2bed < bbb.gtf > ccc.bed
5) intersected with sample bed file (obtained from Plink tped file) 
format of sample bed file:

chr1    183189  183189
chr1    609407  609407

bedtools intersect -wb -a ccc.bed -b sample.bed >  ddd.bed

5) removed unwanted symbols (" and ;) from final intersected file with

    cat ddd.bed | sed 's/"//g' | sed 's/;//g' > eee.bed  


6) printed out only the needed columns from intersected file

    cat eee.bed  | awk '{print $4,$33,$35,$34,$36,$37}'  > fff.bed   

In the end I have file with different content of columns:

KLHL17 0 C C  
KLHL17 0 A A   
PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976215
PERM1 "protein_coding" "C1orf170" gene_synonym 1 976669

However the desired structure should be like this:

PLEKHN1 966179 0 966179 A A
PLEKHN1 966227 0 966227 G G
PLEKHN1 966272 0 966272 G G

Is it because of inconsistency in NCBI GTF files? It should be succesfully intersected with my sample file, so I can switch NC position to gene name and position in sample file and save tped file structure.

Thank you!

  • Can you make your question title more specific to the problem itself instead of the domain in which you encountered that problem? The goal is for someone else who has the same problem to know _if it really is_ the same problem (and thus if the answers you got will help them) just by reading the title. One can have a lot of different problems processing a given data format type, so the current title isn't very specific. – Charles Duffy May 18 '21 at 17:10
  • BTW, not an answer to your question, but it'll help your scripting in general if you avoid using `cat` except when you really need it. Sometimes it's nearly harmless, but other times it makes your script much, much slower to run; for example, `wc -c – Charles Duffy May 18 '21 at 17:14
  • Another thing -- your step 6 awk script completely ignores the `fintest` file passed on its stdin; it's unclear as a reader what the purpose of `fintest` is there. – Charles Duffy May 18 '21 at 17:16
  • Step 5 isn't valid either -- `cat test sed` concatenates together the contents of a file named `test` and a file named `sed`; it doesn't run the `sed` command at all. – Charles Duffy May 18 '21 at 17:19
  • Each Stack Overflow question should be about exactly one narrow, specific problem; can you build a [mre] for a specific step that has unexpected output, so we don't have multiple problems merged together here? – Charles Duffy May 18 '21 at 17:20
  • OK, I corrected previous issues; will split question into separate pieces – user15480777 May 18 '21 at 17:21
  • Hmm. Another note -- I don't see any decompression of the `.gz` file. Surely you aren't running it through awk still-compressed? – Charles Duffy May 20 '21 at 13:07
  • Yes, I unzipped it before. Edited! – user15480777 May 20 '21 at 13:15
  • Where does `ccc.bed-b` come from? I'm trying to follow your steps, but don't see anything that generates it. Same for `sample.bed`. – Charles Duffy May 20 '21 at 13:17
  • BTW, consider piping your steps together instead of generating so many separate output files. `a | b > final` runs both a and b in parallel, whereas `a >out1; b final` doesn't let `b` start until after `a` finished. Even if you want to inspect the intermediate outputs for debugging, `a | tee a.out | b >final` gives you that ability while still parallelizing. – Charles Duffy May 20 '21 at 13:20
  • ...which is to say, a lot of your process (the parts I've run myself so far) can be compressed down to `gunzip -c ccc.bed` – Charles Duffy May 20 '21 at 13:21
  • ccc.bed here is output, it is the usage of gtf2bed command (BedOps package) sample bed is just edited txt file – user15480777 May 20 '21 at 13:21
  • So `ccc.bed-b` and `ccc.bed` are supposed to be the same file? Why two different names? – Charles Duffy May 20 '21 at 13:21
  • BTW, you can also run `sed -e 's/"//g' -e 's/;//g'` to run both replacements in a single copy of `sed`. (Or you could tell `awk` to do those replacements, as with `gsub()`, and not need `sed` at all). – Charles Duffy May 20 '21 at 13:23
  • -a ccc.bed -b sample.bed - (a typo) – user15480777 May 20 '21 at 13:24
  • I think I've exhausted what I can add as someone closely familiar with shell scripting but not familiar with genomics tools. If you had a question about why a specific (non-genomics-domain-specific) tool wasn't doing what it's supposed to, I'd be glad to help, but I'm not sure that's the case. – Charles Duffy May 20 '21 at 13:26

0 Answers0