0

I am analyzing a genetic sequence in R. The columns of the dataframe are the SNPs, and the rows are individuals. The genotype for each individual in the sample for that SNP is recorded as a character, like "CC", "AC", "AA". Since there are only three possible genotypes for each SNP, R reads each column as a factor variable.

I want to get the correlation between each pair of columns, but in order to do that, I need a numeric dataframe. I have been able to read the data in as characters instead of factors and convert the data to either 0, 1, or 2 (as characters) depending on the genotype.

But when I am trying to convert these characters to numeric, R is coercing the '0's to NA. Why is this happening and how can I prevent that? I am not sure how to show my data here, otherwise I would like to show a small sample of it. Any help is much appreciated!

Edit: The name of my dataset is 'hgdpakt'.

This is the code I used to convert the character data from "CC" to "1", for eg:

genowt1 = allele.names(genotype(hgdpakt[,1],sep = "", reorder = "freq"))

This gives me the first and second characters of the genotype as a list, ordered by the frequency of that allele. Next,

A = paste(genowt1[1],genowt1[1],sep = "")
B = paste(genowt1[2],genowt1[2],sep = "")
C = paste(genowt1[1],genowt1[2],sep = "")
D = paste(genowt1[2],genowt1[1],sep = "")

After this assignment, I used the following code to assign each genotype '0','1' or '2' depending on how many minor alleles that genotype carried:

for(j in 1:length(hgdpakt[,1])){
if (hgdpakt[j,1] == A & (!is.na(hgdpakt[j,1]))){
  hgdpakt[j,1] == 0
}else if (hgdpakt[j,1] == B & (!is.na(hgdpakt[j,1]))){
  hgdpakt[j,1] = 2
}else if 
  (hgdpakt[j,1] == C || hgdpakt[j,1] == D || (is.na(hgdpakt[j,1])= TRUE)){
  hgdpakt[j,1] = 1
}

}

After this, I used 'as.numeric' to convert to numeric:

hgdpakt[,1] = as.numeric(hgdpakt[,1])

Hope this helps.

KVemuri
  • 194
  • 1
  • 16
  • Please add some test code. It will be easier to determine what's happening that way. – R_User Apr 10 '15 at 23:07
  • I added some code, but the actual code I used to convert the character data from, for eg, "CC" to "0", is about 20 lines. I am struggling to format that here, but I will try. Thanks for replying! – KVemuri Apr 10 '15 at 23:41
  • Keep in mind that most users here are *not* bioinformaticians, and most of the Information and code in your question will be opaque to them. It’s also mostly irrelevant to the problem at hand. – Konrad Rudolph Apr 11 '15 at 00:09
  • I realize that, but I was trying to give some background to my problem, especially because I was unable to upload any code or sample data. But point taken. – KVemuri Apr 11 '15 at 01:02

1 Answers1

1

How about this? There are probably more efficient ways to find the minor allele, but I'm going to go ahead and use the method based on the genetics package that you suggested above:

library("genetics")
set.seed(101)
genotypes <- c("CC", "AC", "AA")
dd <- as.data.frame(replicate(6,sample(genotypes,10,replace=TRUE)))
count_minor_alleles <- function(x) {
   minor <- allele.names(genotype(x,sep="",reorder="freq"))[1]
   sapply(strsplit(as.character(x),""),
          function(z) sum(z==minor))
}
dd[] <- lapply(dd,count_minor_alleles)
cor(dd)

dd[] <- ... is a minor hack to replace the contents of the data frame without changing the overall formatting; as.data.frame(lapply(dd,count_minor_alleles)) would also work (the basic issue is that lapply() returns a list, which needs to be converted back into a data frame in some way)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks for the reply! I tried this first thing this morning: hgdpakt=lapply(hgdpakt,as.νmeric)−1hgdpakt[] = lapply(hgdpakt,as.numeric)-1 , to get a 0,1,2 coding, but this codes my data depending on what the levels are, and not based on how many minor alleles each genotype has. In the example you used above, if A was the minor allele, I should get a value of 1 for AC, but following your code, I get a value of 2. That's why i had to do everything separately. – KVemuri Apr 11 '15 at 00:15
  • Sorry for the jumbled code above. This is the code I used: `hgdpakt = lapply(hgdpakt,as.numeric)-1` – KVemuri Apr 11 '15 at 00:22
  • Thank you so much for that! I was only looking for some pointers, but your code worked perfectly. I have a follow up question, related to the code in the solution. How does the '[]' work? That is new to me. As far as I gather, it applies the function over all the columns of the dataframe, but I could be wrong. – KVemuri Apr 11 '15 at 03:22