1

I am trying to depict my whole-genome sequence (WGS) data of my parasite, using the circos software.

One of the elements I would like to depict, is the areas of the reference genome for which i do not have sequencing data from my parasite.

I order to do this, I have used Samtools to create an mpileup file, from which I have extracted the positions where the sequence depth = 0. I therefore have a file that looks like this:

$chromosome_name $chromosome_position $depth
chr_1 1 0
chr_1 2 0
chr_1 3 0
chr_2 67 0
chr_2 68 0 
chr_2 1099 0
chr_2 1100 0
chr_2 1101 0

this means that there are 3 positions in chromosome 1, with no sequence data (depth = 0): namely positions 1, 2 and 3. For chromosome 2, the positions with no data are positions 67, 68, 1099, 1100 and 1101.

Due to the fact that my files are enormous (up to 3 million lines), and the fact that alot of the unsequenced positions come in intervals, I would like to create an interval file from the above data. Also, circos requires such an interval-file in order to create tiles. I therefore need to create a new file from the above, that looks like this:

$chromosome_name $start_pos $end_pos
chr_1 1 3
chr_2 67 68
chr_2 1099 1101

I have searched a bunch, but I have only found questions pertaining to grouping data by pre-defined intervals (e.g. group purchases occurring over a period of 6 months, patients by age etc).

So if anybody can help me out, I will be extremely happy! Sidsel

Vince
  • 3,325
  • 2
  • 23
  • 41
Sidsel
  • 11
  • 1

1 Answers1

0

Consider using bedtools. Specifically the bedtools merge sub-command:

http://bedtools.readthedocs.io/en/latest/content/tools/merge.html

From this page, it would seem to do what you want:

bedtools merge combines overlapping or “book-ended” features in an interval file into a single feature which spans all of the combined features.

Moreover, you can use the -d option to specify max distance between featured to merge:

-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features are merged.

Vince
  • 3,325
  • 2
  • 23
  • 41