22

I'm trying to create a subset of a data frame and when I do so, R switches the formatting of the date column. Any idea why or how to fix this?

> head(spyPr2)
        Date   Open   High    Low  Close    Volume Adj.Close
1 12/30/2011 126.02 126.33 125.50 125.50  95599000    125.50
2 12/29/2011 125.24 126.25 124.86 126.12 123507200    126.12
3 12/28/2011 126.51 126.53 124.73 124.83 119107100    124.83
4 12/27/2011 126.17 126.82 126.06 126.49  86075700    126.49
5 12/23/2011 125.67 126.43 125.41 126.39  92187200    126.39
6 12/22/2011 124.63 125.40 124.23 125.27 119465400    125.27
> spyPr2$Date <- as.Date(spyPr2$Date, format = "%m/%d/%Y")
> head(spyPr2)
        Date   Open   High    Low  Close    Volume Adj.Close
1 2011-12-30 126.02 126.33 125.50 125.50  95599000    125.50
2 2011-12-29 125.24 126.25 124.86 126.12 123507200    126.12
3 2011-12-28 126.51 126.53 124.73 124.83 119107100    124.83
4 2011-12-27 126.17 126.82 126.06 126.49  86075700    126.49
5 2011-12-23 125.67 126.43 125.41 126.39  92187200    126.39
6 2011-12-22 124.63 125.40 124.23 125.27 119465400    125.27
> spyPr2 <- data.frame(cbind(spyPr2$Date, spyPr2$Close, spyPr2$Adj.Close))
> str(spyPr2)
'data.frame':   1638 obs. of  3 variables:
 $ X1: num  15338 15337 15336 15335 15331 ...
 $ X2: num  126 126 125 126 126 ...
 $ X3: num  126 126 125 126 126 ...
> head(spyPr2)
     X1     X2     X3
1 15338 125.50 125.50
2 15337 126.12 126.12
3 15336 124.83 124.83
4 15335 126.49 126.49
5 15331 126.39 126.39
6 15330 125.27 125.27

UPDATE:

> spyPr2 <- data.frame(cbind(spyPr2["Date"], spyPr2$Close, spyPr2$Adj.Close))
Error in `[.data.frame`(spyPr2, "Date") : undefined columns selected
> spyPr2 <- data.frame(cbind(spyPr2[,"Date"], spyPr2$Close, spyPr2$Adj.Close))
Error in `[.data.frame`(spyPr2, , "Date") : undefined columns selected

UPDATE 2:

structure(list(Date = structure(c(15338, 15337, 15336, 15335, 
15331, 15330), class = "Date"), Open = c(126.02, 125.24, 126.51, 
126.17, 125.67, 124.63), High = c(126.33, 126.25, 126.53, 126.82, 
126.43, 125.4), Low = c(125.5, 124.86, 124.73, 126.06, 125.41, 
124.23), Close = c(125.5, 126.12, 124.83, 126.49, 126.39, 125.27
), Volume = c(95599000L, 123507200L, 119107100L, 86075700L, 92187200L, 
119465400L), Adj.Close = c(125.5, 126.12, 124.83, 126.49, 126.39, 
125.27)), .Names = c("Date", "Open", "High", "Low", "Close", 
"Volume", "Adj.Close"), row.names = c(NA, -6L), class = "data.frame")
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • 1
    Have you tried using `[` selection instead of `$`? eg `spyPr2["Date"]` – James Jan 20 '12 at 16:29
  • Show us the results of `dput(head(spyPr2))` so we don't have to go to trouble of creating our own data to see what is happening. I suspect the default `cbind()` method is the problem here but would like to run code on my own machine to check. – Gavin Simpson Jan 20 '12 at 16:38
  • No, sorry, I meant the output from `dput()` *before* you process it. I.e. Give is your input data (but we only need the 6 lines you show. – Gavin Simpson Jan 20 '12 at 16:45

2 Answers2

63

Obvious answer is don't do subsetting like that! Use the appropriate tools. What is wrong with

spyPr2.new <- spyPr2[, c("Date", "Close", "Adj.Close")]

?

To explain the behaviour you are seeing, you need to understand what $ returns and how cbind() works. cbind() is one of those oddities in R wherein method dispatch is not done via the usual method but is instead handled via special code buried in the internals of R. This is all the R code behind cbind():

> cbind
function (..., deparse.level = 1) 
.Internal(cbind(deparse.level, ...))
<bytecode: 0x24fa0c0>
<environment: namespace:base>

Not much help, eh? There are methods for data frames and "ts" objects however:

> methods(cbind)
[1] cbind.data.frame cbind.ts*       

   Non-visible functions are asterisked

Before I do the reveal, also note what $ returns (dat2 is your 6 lines of data after converting Date to a "Date" object):

> str(dat2$Date)
 Date[1:6], format: "2011-12-30" "2011-12-29" "2011-12-28" "2011-12-27" ...

This is a "Date" object, which is a special vector really.

> class(dat2$Date)
[1] "Date"

The key thing is that it is not a data frame. So when you use cbind(), the internal code is seeing three vectors and the internal code creates a matrix.

> (c1 <- cbind(dat2$Date, dat2$Close, dat2$Adj.Close))
      [,1]   [,2]   [,3]
[1,] 15338 125.50 125.50
[2,] 15337 126.12 126.12
[3,] 15336 124.83 124.83
[4,] 15335 126.49 126.49
[5,] 15331 126.39 126.39
[6,] 15330 125.27 125.27
> class(c1)
[1] "matrix"

There can only be numeric or character matrices in R so the Date object is converted to a numeric vector:

> as.numeric(dat2$Date)
[1] 15338 15337 15336 15335 15331 15330

to allow cbind() to produce a numeric matrix.

You can force the use of the data frame method by calling it explicitly and it does know how to handle "Date" objects and so doesn't do any conversion:

> cbind.data.frame(dat2$Date, dat2$Close, dat2$Adj.Close)
   dat2$Date dat2$Close dat2$Adj.Close
1 2011-12-30     125.50         125.50
2 2011-12-29     126.12         126.12
3 2011-12-28     124.83         124.83
4 2011-12-27     126.49         126.49
5 2011-12-23     126.39         126.39
6 2011-12-22     125.27         125.27

However, all the explanation aside, you are trying to do the subsetting in a very complex manner. [ as a subset function works just fine:

> dat2[, c("Date", "Close", "Adj.Close")]
        Date  Close Adj.Close
1 2011-12-30 125.50    125.50
2 2011-12-29 126.12    126.12
3 2011-12-28 124.83    124.83
4 2011-12-27 126.49    126.49
5 2011-12-23 126.39    126.39
6 2011-12-22 125.27    125.27

subset() is also an option but not needed here:

> subset(dat2, select = c("Date", "Close", "Adj.Close"))
        Date  Close Adj.Close
1 2011-12-30 125.50    125.50
2 2011-12-29 126.12    126.12
3 2011-12-28 124.83    124.83
4 2011-12-27 126.49    126.49
5 2011-12-23 126.39    126.39
6 2011-12-22 125.27    125.27
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • That did it. Thanks very much. Didn't even think of doing it that way. – screechOwl Jan 20 '12 at 16:51
  • 4
    cbind.data.frame is helpful for merging date columns in other contexts as well. – etov Jul 17 '14 at 07:32
  • 1
    Awesome ! And thank you for providing the `cbind.data.frame` method. I had to do a cbind on columns from different tables hence, could not use the subsetting method. – Yohan Obadia May 11 '16 at 10:38
5

I think I might call this a hidden instance of the drop = FALSE gotcha with data frames.

When you use cbind, it only uses the data frame method if at least one of the components are also data frames. Otherwise, everything is converted to a single type in order to construct a matrix.

Thus, calling cbind on elements like spyPr2$Date or spyPr2[,'Date'] will result in a matrix (losing the date structure), which will not be magically restored by wrapping it all in data.frame.

You can do this if you use [ to select each column, but only by using drop = FALSE which prevents R from converting the result to a vector (which lands you right back where you started with R coercing the result to a matrix):

cbind(spyPr2[,'Date',drop = FALSE],spyPr2[,'Close'])

is sufficient, since you only need one of the components to be a data frame.

But Gavin is right in general, you shouldn't be subsetting your data frame this way.

joran
  • 169,992
  • 32
  • 429
  • 468