I am analyzing a genetic sequence in R. The columns of the dataframe are the SNPs, and the rows are individuals. The genotype for each individual in the sample for that SNP is recorded as a character, like "CC", "AC", "AA". Since there are only three possible genotypes for each SNP, R reads each column as a factor variable.
I want to get the correlation between each pair of columns, but in order to do that, I need a numeric dataframe. I have been able to read the data in as characters instead of factors and convert the data to either 0, 1, or 2 (as characters) depending on the genotype.
But when I am trying to convert these characters to numeric, R is coercing the '0's to NA. Why is this happening and how can I prevent that? I am not sure how to show my data here, otherwise I would like to show a small sample of it. Any help is much appreciated!
Edit: The name of my dataset is 'hgdpakt'.
This is the code I used to convert the character data from "CC" to "1", for eg:
genowt1 = allele.names(genotype(hgdpakt[,1],sep = "", reorder = "freq"))
This gives me the first and second characters of the genotype as a list, ordered by the frequency of that allele. Next,
A = paste(genowt1[1],genowt1[1],sep = "")
B = paste(genowt1[2],genowt1[2],sep = "")
C = paste(genowt1[1],genowt1[2],sep = "")
D = paste(genowt1[2],genowt1[1],sep = "")
After this assignment, I used the following code to assign each genotype '0','1' or '2' depending on how many minor alleles that genotype carried:
for(j in 1:length(hgdpakt[,1])){
if (hgdpakt[j,1] == A & (!is.na(hgdpakt[j,1]))){
hgdpakt[j,1] == 0
}else if (hgdpakt[j,1] == B & (!is.na(hgdpakt[j,1]))){
hgdpakt[j,1] = 2
}else if
(hgdpakt[j,1] == C || hgdpakt[j,1] == D || (is.na(hgdpakt[j,1])= TRUE)){
hgdpakt[j,1] = 1
}
}
After this, I used 'as.numeric' to convert to numeric:
hgdpakt[,1] = as.numeric(hgdpakt[,1])
Hope this helps.