3

I'm dealing with text files and vectors.

I have a space separated text file with the following format:

id1 AA 44 AG 20 GG 36
id2 CC 30 CT 22 TT 48
id3 CT 60 CC 30 TT 10
...

And I need a code that loops through each line and put the id in a variable and the rest of the values in a vector. Example of a vector corresponding to the first line:

x <- id1
y <- c(AA=40,AG=20,GG=36)

Edit: I need to use HWChisq function from HardyWeinberg package to exclude SNPs that have p-value < 0.001. Function requires named vector of counts for each allele.

zx8754
  • 52,746
  • 12
  • 114
  • 209
NST
  • 115
  • 10
  • Looks like genotypes data, and numbers are percentages of alleles (sum is 100)? Could you tell us why do you need this data in that format, what is the next step? – zx8754 Mar 21 '20 at 20:41
  • 1
    @zx8754 Yes you are right! But the numbers are the count of alleles not the percentage. I need to use HWChisq function from HardyWeinberg package to exclude SNPs that have p-value < 0.001 – NST Mar 21 '20 at 20:44
  • 1
    @NST if you can show a `dput` of modified example, it would be great – akrun Mar 21 '20 at 20:52

2 Answers2

2

If we have alternate columns (assuming we have an object created in R by reading the .csv file with read.csv/read.table), then split by row with asplit excluding the first column 'id' column, and create a named vector with setNames

lst1 <- Map(setNames, asplit(df1[-1][c(FALSE, TRUE)], 1), 
         asplit(df1[-1][c(TRUE, FALSE)], 1))
names(lst1) <- df1[[1]]
lst1$id1
# AA AG GG 
# 44 20 36 

data

df1 <- structure(list(id = c("id1", "id2", "id3"), v1 = c("AA", "CC", 
"AA"), v2 = c(44L, 30L, 60L), v3 = c("AG", "CT", "AG"), v4 = c(20L, 
22L, 30L), v5 = c("GG", "TT", "GG"), v6 = c(36L, 48L, 10L)), 
class = "data.frame", row.names = c(NA, 
-3L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you akurn. The columns are not alternate, it's just an example. I will modify the question. – NST Mar 21 '20 at 20:41
1

Loop through row by row, then apply the HWE function:

library("HardyWeinberg")

# data
df1 <- read.table(text = "
id1 AA 44 AG 20 GG 36
id2 CC 30 CT 22 TT 48
id3 CT 60 CC 30 TT 10", header = FALSE, stringsAsFactors = FALSE)

out <- apply(df1[, c(3, 5, 7)], 1, function(i){
  x <- HWChisq(setNames(i, c("AA", "AB", "BB")), verbose = FALSE)
  x$pval
})

# [1] 5.774374e-09 1.182236e-07 7.434226e-02

Pretty output:

cbind(df1, HWE = out)
#    V1 V2 V3 V4 V5 V6 V7          HWE
# 1 id1 AA 44 AG 20 GG 36 5.774374e-09
# 2 id2 CC 30 CT 22 TT 48 1.182236e-07
# 3 id3 CT 60 CC 30 TT 10 7.434226e-02

To calculate the HWE for X-chromosome see vignette:

4. X-chromosomal tests for Hardy-Weinberg equilibrium

Recently, Graffelman and Weir (2016) have proposed specific tests for HWE for bi-allelic markers on the X-chromosome. These tests take both males and females into account. The X-chromosomal tests can be carried out by the same functions mentioned in the previous Section (HWChisq, HWLratio, HWExact, HWPerm) and adding the argument x.linked=TRUE to the function call.

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • I have many SNPs that has two genotypes only. How should use HWChisq test for them? your help is highly appreciated. – NST Mar 22 '20 at 13:53
  • @NST See the edit, it is explained in the vignette, please read the manuals. – zx8754 Mar 23 '20 at 07:32