8

I'm struggling to read my tables in Variant Call Format (VCF) with R. Each file has some comment lines starting with ##, and then the header starting with #.

## contig=<ID=OTU1431,length=253>
## contig=<ID=OTU915,length=253>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  /home/sega/data/bwa/reads/0015.2142.fastq.q10sorted.bam
Eubacterium_ruminantium_AB008552    56  .   C   T   228 .   DP=212;AD=0,212;VDB=0;SGB=-0.693147;MQ0F=0;AC=2;AN=2;DP4=0,0,0,212;MQ=59    GT:PL   1/1:255,255,0

How can I read such table without missing a header? Using read.table() with comment.char = "##" returns an error: "invalid 'comment.char' argument"

zx8754
  • 52,746
  • 12
  • 114
  • 209
Slavskii Sergei
  • 138
  • 1
  • 9

1 Answers1

9

If you want to read VCF, you can also just try to use readVcf from VariantAnnotation in Bioconductor. https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html

Otherwise, I can highly recommend fread function in data.table package. It allows you to use the skip argument to allow it to start importing when a substring has been found.

e.g.

fread("test.vcf", skip = "CHROM")

should work.

Henrik
  • 65,555
  • 14
  • 143
  • 159
wligtenberg
  • 7,695
  • 3
  • 22
  • 28
  • 5
    **fread** function with _skip_ argumernt was really helpful. Thanks. – Slavskii Sergei Feb 21 '17 at 15:10
  • Sometimes VariantAnnotation fails and I think this second suggestion really works quite well for corner case VCFs. One can do whatever is needed to clean the data and then add a %>% Granges() at the end to do overlap analysis, annotation, and merging (after making sure that chr, start, and end come first with a dplyr::select(chr,start,end,dplyr::everything()). – James Dalgleish Jul 17 '23 at 08:45