2

I have the following data (small part of it) named "short2_pre_snp_tumor.txt"

rs987435        C       G       1       1       1       0       2
rs345783        C       G       0       0       1       0       0
rs955894        G       T       1       1       2       2       1
rs6088791       A       G       1       2       0       0       1
rs11180435      C       T       1       0       1       1       1
rs17571465      A       T       1       2       2       2       2
rs17011450      C       T       2       2       2       2       2
rs6919430       A       C       2       1       2       2       2
rs2342723       C       T       0       2       0       0       0
rs11992567      C       T       2       2       2       2       2

and I need to get the PED and MAP file using Python, as R is really slow in case of large dataset.

I have the following code in R:

 tm <- proc.time()
    d<-read.table("short2_pre_snp_tumor.txt")
    n<-nrow(d)  #237196
    nrs<-ncol(d)-3 #1116
    dd<- data.frame(matrix(NA, nrow= ncol(d)-3, ncol=2*nrow(d)), stringsAsFactors=TRUE)

    for (j in 1:nrs) {
    for (i in 1:n)  { 
    if (d[i, j+3]==0) {
    dd[j, 2*i-1]<-as.character(d[i,2])
    dd[j, 2*i]<-as.character(d[i,2])
    } else if (d[i, j+3]==1) {
    dd[j, 2*i-1]<-as.character(d[i,2])
    dd[j, 2*i]<-as.character(d[i,3])
    } else if (d[i, j+3]==2) {
    dd[j, 2*i-1]<-as.character(d[i,3])
    dd[j, 2*i]<-as.character(d[i,3])
    }
    }
    }


 ped6front<-data.frame(FID = 1: nrow(dd), IID= 1: nrow(dd), PID=0, MID=0, SEX= sample(1:2, nrow(dd), replace=T), PHENOTYPE=2)
    BRCA_tumorfromR.ped <- cbind(ped6front,dd)
   write.table(BRCA_tumorfromR.ped, “BRCA_tumor.ped”, append=FALSE, quote=FALSE, col.names=FALSE)

    proc.time() #ptm
zx8754
  • 52,746
  • 12
  • 114
  • 209
mms
  • 365
  • 1
  • 3
  • 12
  • plink is not what you think. Edited out. And could you be more specific? because we don't know your trade. What do you want to extract (columns? rows?) and show us your code, promise I won't downvote even if it is bad :) – Jean-François Fabre Nov 29 '16 at 23:29
  • @Jean-François Fabre I want to perform an Association analysis using Plink.. so I have the code in R to get the PED and the MAP file but it takes long time to get the results – mms Nov 29 '16 at 23:38
  • still, I don't know what a PED or a MAP file. Can you transcribe it in simple column name, etc... generic language not bioinformatics lingo. – Jean-François Fabre Nov 30 '16 at 05:23
  • @Jean-FrançoisFabre I added links to plink's PED MAP formats. – zx8754 Nov 30 '16 at 07:39
  • R would be pretty quick, too, if we manage to vectorise the solution. With your current script, we are looping through every row, this is the problem. – zx8754 Nov 30 '16 at 07:42

1 Answers1

2

Here is using R:

# raw data
myRaw <- read.table(text = "
rs987435        C       G       1       1       1       0       2
rs345783        C       G       0       0       1       0       0
rs955894        G       T       1       1       2       2       1
rs6088791       A       G       1       2       0       0       1
rs11180435      C       T       1       0       1       1       1
rs17571465      A       T       1       2       2       2       2
rs17011450      C       T       2       2       2       2       2
rs6919430       A       C       2       1       2       2       2
rs2342723       C       T       0       2       0       0       0
rs11992567      C       T       2       2       2       2       2")

nIndividuals <- ncol(myRaw) - 3
nSNPs <- nrow(myRaw)

# make map, easy
MAP <- data.frame(
  CHR = 1,
  SNP = myRaw$V1,
  CM = 0,
  BP = seq(nSNPs))

# get first 6 columns of PED, easy
PED6 <- data.frame(
  FID = seq(nIndividuals),
  IID = seq(nIndividuals),
  FatherID = 0,
  MotherID = 0,
  Sex = 1,
  Phenotype = 1)

# convert 0,1,2 to genotypes, a bit tricky
# make helper dataframe for matching alleles
myAlleles <- data.frame(
  AA = paste(myRaw$V2, myRaw$V2),
  AB = paste(myRaw$V2, myRaw$V3),
  BB = paste(myRaw$V3, myRaw$V3))

# make index to match with alleles
PEDsnps <- myRaw[, 4:ncol(myRaw)] + 1

# convert
PEDsnpsAB <- 
  sapply(seq(nSNPs), function(snp)
    sapply(PEDsnps[snp, ], function(ind) myAlleles[snp, ind]))

# column bind first 6 cols with genotypes
PED <- cbind(PED6, PEDsnpsAB)

#output PED and MAP
write.table(PED, "gwas.ped", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t")
write.table(MAP, "gwas.map", quote = FALSE, col.names = FALSE, row.names = FALSE, sep = "\t")

# test plink
# plink --file gwas
# PLINK v1.90b3c 64-bit (2 Feb 2015)         https://www.cog-genomics.org/plink2
# (C) 2005-2015 Shaun Purcell, Christopher Chang   GNU General Public License v3
# Logging to plink.log.
# 258273 MB RAM detected; reserving 129136 MB for main workspace.
# .ped scan complete (for binary autoconversion).
# Performing single-pass .bed write (10 variants, 5 people).
# --file: plink.bed + plink.bim + plink.fam written.
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • I have a question why phenotype=1 ? – mms Nov 30 '16 at 18:16
  • @MarwahSoliman It can be any phenotype you have, I used dummy 1, meaning everyone is control. We can also skip this column, but then we need to use "--no-pheno" flag. – zx8754 Nov 30 '16 at 18:22
  • actually this data is for people who has tumor then the phenotype=2 , right? – mms Nov 30 '16 at 18:51
  • if I need to generalize the code for the whole data set not just the 10 rows but they have the same number of columns what should be modified because I got NA beyond that. – mms Dec 01 '16 at 00:41
  • @MarwahSoliman In my code the number of columns and rows are calculated from the input data. If your "real" data is the same as the example you provided, then above code should work fine. Please ensure the genotypes are always 0, 1, or 2. – zx8754 Dec 01 '16 at 07:00
  • I got the following output for the .ped file > PED { FID IID FatherID MotherID Sex Phenotype 1 2 3 4 5 6 7 8 9 10 V4 1 1 0 0 1 1 C G C C G T A G C T A T T T C C C C T T V5 2 2 0 0 1 1 C G C C G T G G C C T T T T A C T T T T V6 3 3 0 0 1 1 C G C G T T A A C T T T T T C C C C T T V7 4 4 0 0 1 1 C C C C T T A A C T T T T T C C C C T T V8 5 5 0 0 1 1 G G C C G T A G C T T T T T C C C C T T} – mms Dec 03 '16 at 15:58
  • it give me warnings when I run the PEDsnps <- myRaw[, 4:ncol(myRaw)] + 1, it says In Ops.factor(left, right) : ‘+’ not meaningful for factors – mms Dec 03 '16 at 16:45
  • I used the following, PEDsnps <- as.numeric(as.character( myRaw[ , 4:ncol(myRaw)]))+1, is this correct – mms Dec 03 '16 at 17:34
  • I don't know how to fix the error I tried so many things it doesn't work, could you help please?? – mms Dec 03 '16 at 19:01
  • @MarwahSoliman solution works for the data you provided, I can't help, if you don't supply representative data. – zx8754 Dec 03 '16 at 19:24
  • the data is 905460 x 1119, so as in the example I provided it has 1119 columns instead of 8 columns , the 1st to 3rd columns are the same and starting from the 4th column it is just 0,1 or 2 – mms Dec 03 '16 at 20:34
  • @MarwahSoliman by representative, I meant, I need to have the data that would fail my solution. If I cannot reproduce the problem I cannot help. – zx8754 Dec 03 '16 at 20:49
  • I found the solution, Thank you, again I appreciate your help :) – mms Dec 05 '16 at 18:13
  • Please share your solution. – zx8754 Dec 05 '16 at 19:03
  • sorry it took me time to answer, the problem was that there was a character "-" that made 0,1,2 be read as a factor so I reclined the data and it works but it took long time , R is not good for huge data set – mms Jan 17 '17 at 00:13