8

I have a .txt file that looks something like this:

rs1 NC AB NC     
rs2 AB NC AA  
rs3 NC NC NC  
...  

For each row, I would like to count the frequencies of "NC", so that my output will be something like below:

rs1 2  
rs2 1  
rs3 3  
...

Can someone tell me how to do this in R or in Linux? Many thanks!

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Renee
  • 113
  • 1
  • 5

4 Answers4

10
df$count <- rowSums(df[-1] == "NC")
#    V1 V2 V3 V4 count
# 1 rs1 NC AB NC     2
# 2 rs2 AB NC AA     1
# 3 rs3 NC NC NC     3

We can use rowSums on the matrix that is created from this expression df[-1] == "NC".

Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • 1
    Thanks! : ) Of course not. It also does not work if `df` is an igraph mapping, it also doesn't work if `df` is an orange slice or flat screen tv. Not sure what your point is. – Pierre L Oct 01 '17 at 17:26
7
dat <- read.table(text="rs1 NC AB NC rs2 AB NC AA rs3 NC NC NC")
dat <- rbind(dat, dat, dat, dat)

You can use a rowwise table to get the frequencies per row In this case for row 1 to 4 the frequencies that are equal as i copied the data

freq <- apply(dat, 1, table)
    1 2 3 4 # row-number
AA  1 1 1 1
AB  2 2 2 2
NC  6 6 6 6
rs1 1 1 1 1
rs2 1 1 1 1
rs3 1 1 1 1

If you want to have aggregated frequencies over all rows use

rowSums(freq)
AA  AB  NC rs1 rs2 rs3 
 4   8  24   4   4   4 
Rentrop
  • 20,979
  • 10
  • 72
  • 100
0

Using newer version of dplyr (>=1.0), you can use rowwise and c_across to sum across columns.

dat <- read.table(text="
SNP G1 G2 G3
rs1 NC AB NC
rs2 AB NC AA
rs3 NC NC NC", header=TRUE)

library(dplyr)
dat %>% 
  rowwise() %>% 
  mutate(Total = sum(c_across(G1:G3)=="NC"))
#   SNP   G1    G2    G3    Total
#   <chr> <chr> <chr> <chr> <int>
# 1 rs1   NC    AB    NC        2
# 2 rs2   AB    NC    AA        1
# 3 rs3   NC    NC    NC        3
MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

This is what worked for me. I had a set of 24 variables which worked under the 1 or missing system (checked or unchecked). I needed to count them. the final score was the count of checked variables.

data <- data %>% dplyr::mutate(final_score = 24 - rowSums(across(c( Var1, Var2, Var3, Var4, Var5, Var6, Var7, Var8, Var9, Var10, Var11, Var12, Var13, Var14, Var15, Var16, Var17, Var18, Var19, Var20, Var21, Var22, Var23, Var24), is.na)))

Notes These variables were not XXX#, they each had different names, so anything on the order of var1-var24 was not possible.

I could have used var1:var24, but I don't like trusting the order of variables in my data set.

Barry DeCicco
  • 251
  • 1
  • 7