-2

I am learning to process VCF (variant call files) to produce plots and reports. Here is the R code, which crashes for unknown to me reasons. Please advise how to fix it and tell appropriate tutorials.

library(VariantAnnotation)  
library(SNPRelate)

vcf<-readVcf("test.vcf","hg19") # load your VCF file from a set dir

snpgdsVCF2GDS("test.vcf", "my.gds")

snpgdsSummary("my.gds")

genofile <- openfn.gds("my.gds")

#dendogram

dissMatrix  <-  snpgdsDiss(genofile , sample.id=NULL, snp.id=NULL,
autosome.only=TRUE,remove.monosnp=TRUE, maf=NaN, missing.rate=NaN,
num.thread=10, verbose=TRUE)

snpHCluster <-  snpgdsHCluster(dist, sample.id=NULL, need.mat=TRUE,
hang=0.25)

cutTree <- snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5,
n.perm = 5000, samp.group=NULL,col.outlier="red", col.list=NULL,
pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE,
verbose=TRUE)

#pca

sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))

pop_code <- read.gdsn(index.gdsn(genofile, "sample.id")

pca <- snpgdsPCA(genofile)

tab <- data.framesample.id = pca$sample.id,pop =
factor(pop_code)[match(pca$sample.id, sample.id)],EV1 =
pca$eigenvect[,1],EV2 = pca$eigenvect[,2],stringsAsFactors = FALSE)

plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),xlab="eigenvector 2",
ylab="eigenvector 1") legend("topleft", legend=levels(tab$pop),
pch="o", col=1:nlevels(tab$pop))
Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Vassia Alk
  • 140
  • 4
  • 1
    I think folks could give you more help if you included some example of test.vcf or pointed to a location where a small example file could be downloaded. Also what errors are you seeing, at which part of that script. There is a lot of code above; exactly what isn't working. – jraab Feb 03 '15 at 13:35

1 Answers1

1

Your code has several issues:

-the snpgdsHCluster step should be run on dissMatrix, not dist:

snpHCluster <-  snpgdsHCluster(dissMatrix, sample.id=NULL, need.mat=TRUE,
    hang=0.25)

-you need a paren after dataframe in the tab line:

tab <- data.frame(sample.id = pca$sample.id,pop =
   factor(pop_code)[match(pca$sample.id, sample.id)],EV1 =
   pca$eigenvect[,1],EV2 = pca$eigenvect[,2],stringsAsFactors = FALSE)

-legend is a separate command from plot:

plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),xlab="eigenvector 2",
    ylab="eigenvector 1") 

legend("topleft", legend=levels(tab$pop),
    pch="o", col=1:nlevels(tab$pop))

I think otherwise it should work for you.

heathobrien
  • 1,027
  • 7
  • 11