28

I can't imagine I'm the first person with this question, but I haven't found a solution yet (here or elsewhere).

I have a few columns, which I want to average in R. The only minimally tricky aspect is that some columns contain NAs.

For example:

Trait Col1 Col2 Col3
DF    23   NA   23
DG    2    2    2
DH    NA   9    9

I want to create a Col4 that averages the entries in the first 3 columns, ignoring the NAs. So:

 Trait Col1 Col2 Col3 Col4
 DF    23   NA   23   23
 DG    2    2    2    2
 DH    NA   9    9    9 

Ideally something like this would work:

data$Col4 <- mean(data$Chr1, data$Chr2, data$Chr3, na.rm=TRUE)

but it doesn't.

Edward Ruchevits
  • 6,411
  • 12
  • 51
  • 86
mfk534
  • 719
  • 1
  • 9
  • 21

2 Answers2

35

You want rowMeans() but importantly note it has a na.rm argument that you want to set to TRUE. E.g.:

> mat <- matrix(c(23,2,NA,NA,2,9,23,2,9), ncol = 3)
> mat
     [,1] [,2] [,3]
[1,]   23   NA   23
[2,]    2    2    2
[3,]   NA    9    9
> rowMeans(mat)
[1] NA  2 NA
> rowMeans(mat, na.rm = TRUE)
[1] 23  2  9

To match your example:

> dat <- data.frame(Trait = c("DF","DG","DH"), mat)
> names(dat) <- c("Trait", paste0("Col", 1:3))
> dat
  Trait Col1 Col2 Col3
1    DF   23   NA   23
2    DG    2    2    2
3    DH   NA    9    9
> dat <- transform(dat, Col4 = rowMeans(dat[,-1], na.rm = TRUE))
> dat
  Trait Col1 Col2 Col3 Col4
1    DF   23   NA   23   23
2    DG    2    2    2    2
3    DH   NA    9    9    9
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • 1
    rowMeans it is! Thanks for your time :] – mfk534 Sep 18 '12 at 23:29
  • This is (correct but) not generic enough to be worth remembering. I added another answer. – Azim Oct 27 '19 at 14:52
  • 1
    @azim There are over 300,000 questions in the [tag:r] tag here. Quite a few of those give `apply()` and co as a potential answer. The question is about row means; you don’t want to do that using `apply()` routinely as it is very slow in comparison. You’re answering a question that wasn’t asked here but has been asked many times elsewhere here. – Gavin Simpson Oct 27 '19 at 16:31
  • @GavinSimpson The question does not say anything about the execution time or the fact that the dataset is large. The main issue was how to get rid of NA's while computing the `mean`. Besides, you can always flip your dataframe and operate on columns instead of rows. Moreover, you can run things in parallel and run faster than any "fast" but sequential operation. Something that is quite easy with family of `apply` functions. My main issue with this answer is its limitation. I don't think it is realistic to have one row-function for every function that exists. – Azim Oct 28 '19 at 17:13
  • @Azim And the question doesn't say anything about doing anything but taking the means over rows while accounting for `NA`s. But that didn't stop you coming in here & telling people that the correct answer isn't "worth remembering". I care about execution time & given the choice one should use `rowMeans()` & `colMeans()` in code you write, esp if that is going to be used by other people. You might not think that it's important to have row- or col-wise special functions but others disagree; the *matrixStats* package that specifically adds many common functions that are easily sped up like this – Gavin Simpson Oct 28 '19 at 17:28
  • @GavinSimpson Please just let a second option be available to users. That's all. My answer was added ~7 years after yours was accepted. So just leave a comment about your concern and then let it go. You don't need to make it personal. I believe `R` language, despite it's strength in many aspects, has some fundamental issues, one being coming up with a workaround for every single problem; `rowMeans` is just one good example of this kind. – Azim Oct 29 '19 at 14:00
7

Why NOT the accepted answer? The accepted answer is correct, however, it is too specific to this particular task and impossible to be generalized. What if we need, instead of mean, other statistics like var, skewness, etc. , or even a custom function?

A more flexible solution:

row_means <- apply(X=data, MARGIN=1, FUN=mean, na.rm=TRUE)

More details on apply:

Generally, to apply any function (custom or built-in) on the entire dataset, column-wise or row-wise, apply or one of its variations (sapply, lapply`, ...) should be used. Its signature is:

apply(X, MARGIN, FUN, na.rm)

where:

  • X: The data of form dataframe or matrix.
  • MARGIN: The dimension on which the aggregation takes place. Use 1 for row-wise operation and 2 for column-wise operation.
  • FUN: The operation to be called on the data. Here any pre-defined R functions, as well as any user-defined function could be used.
  • na.rm: If TRUE, the NA values will be removed before FUN is called.

Why should I use apply?

For many reasons, including but not limited to:

  1. Any function can be easily plugged in to apply.
  2. For different preferences such as the input or output data types, other variations can be used (e.g., lapply for operations on lists).
  3. (Most importantly) It facilitates scalability since there are versions of this function that allows parallel execution (e.g. mclapply from {parallel} library). For instance, see [+] or [+].
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Azim
  • 1,596
  • 18
  • 34