1

Suppose a data frame df has a column speed, then what is difference in the way accessing the column like so:

df["speed"]

or like so:

df$speed

The following calculates the mean value correctly:

lapply(df["speed"], mean) 

But this prints all values under the column speed:

lapply(df$speed, mean)
Claus Wilke
  • 16,992
  • 7
  • 53
  • 104
  • 4
    `df["speed"] ` returns you a dataframe whereas `df$speed` returns a numeric vector. Check `class(df["speed"] )` and `class(df$speed)` and most probably you just need `mean(df$speed)`, no need of `lapply`. – Ronak Shah Dec 27 '17 at 05:29
  • 2
    not sure why is this question downvoted – Hardik Gupta Dec 27 '17 at 05:44
  • 1
    @RichScriven So is [this question](https://stackoverflow.com/questions/1169248/r-function-for-testing-if-a-vector-contains-a-given-element). – Vincent Dec 27 '17 at 07:18
  • 1
    Pooja, it might be informative to read [`help('[.data.frame')`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.data.frame.html). Note that `df["speed"]` is similar to `df[,"speed",drop=FALSE]`, and `df$speed` is similar to `df[,"speed"]` and `df[["speed"]]`. Confusing, perhaps, but documented. – r2evans Dec 27 '17 at 07:32
  • 1
    I think it's a fair question. It's clear and answerable, and it's one of those quirks of the R language. Unless somebody argues that it's been asked before already I vote to leave open. – Claus Wilke Dec 27 '17 at 19:05

2 Answers2

4

There are two elements to the question in the OP. The first element was addressed in the comments: df["speed"] is an object of type data.frame() whereas df$speed is a numeric vector. We can see this via the str() function.

We'll illustrate this with Ezekiel's 1930 analysis of speed and stopping distance, the cars data set from the datasets package.

> library(datasets)
> data(cars)
> 
> str(cars["speed"])
'data.frame':   50 obs. of  1 variable:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
> str(cars$speed)
 num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
> 

The second element that was not addressed in the comments is that lapply() behaves differently when passed a vector versus a list().

With a vector, lapply() processes each element in the vector independently, producing unexpected results for a function such as mean().

> unlist(lapply(cars$speed,mean))
 [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
[26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

What happened?

Since each element of cars$speed is processed by mean() independently, lapply() returns a list of 50 means of 1 number each: the original elements in the cars$speed vector.

Processing a list with lapply()

With a list, each element of the list is processed independently. We can calculate how many items will be processed by lapply() with the length() function.

> length(cars["speed"])
[1] 1
>

Since a data frame is also a list() that contains one element of type data.frame(), the length() function returns the value 1. Therefore, when processed by lapply(), a single mean is calculated, not one per row of the speed column.

> lapply(cars["speed"],mean)
$speed
[1] 15.4

> 

If we pass the entire cars data frame as the input object for lapply(), we obtain one mean per column in the data frame, since both variables in the data frame are numeric.

> lapply(cars,mean)
$speed
[1] 15.4

$dist
[1] 42.98

> 

A theoretical perspective

The differing behaviors of lapply() are explained by the fact that R is an object oriented language. In fact, John Chambers, creator of the S language on which R is based, once said:

In R, two slogans are helpful.

-- Everything that exists is an object, and
-- Everything that happens is a function call.

John Chambers, quoted in Advanced R, p. 79.

The fact that lapply() works differently on a data frame than a vector is an illustration of the object oriented feature of polymorphism where the same behavior is implemented in different ways for different types of objects.

Claus Wilke
  • 16,992
  • 7
  • 53
  • 104
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • 1
    THX for supporting new R and SO users to learn R by posting a tutorial-alike answer. I think we should care about new R users by guiding them to right documentation ("learning to learn" :-) – R Yoda Dec 27 '17 at 10:30
  • 1
    @RYoda -- I agree we should guide people to the right documentation, but some of the R documentation is very cryptic, e.g. [help for the Extract Operator](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.html). Another challenge for new R users is that they often don't know what questions to ask or how to ask them, as evidenced by all the comments referencing [How to Create a Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help) on questions posted by people with <100 reputation. Tutorial style answers help people see solutions through working software. – Len Greski Dec 27 '17 at 10:49
2

While this looks like an beginner's question I think it's worth answering it since many beginners could have a similar question and a guide to the corresponding documentation is helpful IMHO.

No up-votes please - I am just collecting the comment fragments from the question that contribute to the answer - feel free to edit this answer...*

  1. A data.frame is a list of vectors with the same length (number of elements). Please read the help in the R console (by typing ?data.frame)

  2. The $ operator is implemented by returning one column as vector (?"$.data.frame")

  3. lapply applies a function to each element of a list (see ?lapply). If the first param X is a scalar vector (integer, double...) with multiple elements, each element of the vector is converted ("coerced") into one separate list element (same as as.list(1:26))

Examples:

x <- data.frame(a = LETTERS, b = 1:26, stringsAsFactors = FALSE)
b.vector <- x$b
b.data.frame <- x["b"]
class(b.vector)       # integer
class(b.data.frame)   # data.frame

lapply(b.vector, mean)
# returns a result list with 26 list elements, the same as `lapply(1:26, mean)`
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 2
# ... up to list element 26

lapply(b.data.frame, mean)
# returns a list where each element of the input vector in param X
# becomes a separate list element (same as `as.list(1:26)`)
# $b
# [1] 13.5

So IMHO your original question can be reduced to: Why is lapply behaving differently if the first parameter is a scalar vector instead of a list?

R Yoda
  • 8,358
  • 2
  • 50
  • 87