3

I have two very similar csv files. Stock prices for 2 different stocks downloaded from the same source in the same format. However, read.csv in R is reading them differently.

> tab1=read.csv(path1)
> tab2=read.csv(path2)

> head(tab1)
        Date   Open   High    Low  Close  Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461

> head(tab2)
       Date  Open  High   Low Close  Volume Adj.Close
1 12/1/2014 73.39 75.20 71.75 72.29 1561400  71.92211
2 11/3/2014 69.28 74.92 67.88 73.74 1421600  72.97650
3 10/1/2014 66.18 74.95 63.42 69.21 1775400  68.49341
4  9/2/2014 68.34 68.57 65.49 66.32 1249200  65.63333
5  8/1/2014 67.45 68.99 65.88 68.26 1655400  67.20743
6  7/1/2014 64.07 69.50 63.09 67.46 1733600  66.41976

If I try to use colClasses in read.csv then the dates for the second table are read incorrectly.

> tab1=read.csv(path1,colClasses=c("Date",rep("numeric",6)))
> tab2=read.csv(path2,colClasses=c("Date",rep("numeric",6)))

> head(tab1)
        Date   Open   High    Low  Close  Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461

> head(tab2)
        Date  Open  High   Low Close  Volume Adj.Close
1 0012-01-20 73.39 75.20 71.75 72.29 1561400  71.92211
2 0011-03-20 69.28 74.92 67.88 73.74 1421600  72.97650
3 0010-01-20 66.18 74.95 63.42 69.21 1775400  68.49341
4 0009-02-20 68.34 68.57 65.49 66.32 1249200  65.63333
5 0008-01-20 67.45 68.99 65.88 68.26 1655400  67.20743
6 0007-01-20 64.07 69.50 63.09 67.46 1733600  66.41976

Not sure how I can make this issue reproducible without attaching the .csv files. I'm attaching snapshots of the two files. Any help will be appreciated. tab1 tab2

Thanks

Abhirup Datta
  • 103
  • 2
  • 12
  • 1
    How about providing the first few lines of the .csv files copied from a text editor? – talat Apr 28 '15 at 19:54
  • 1
    The dates in the two csv files don't appear to be in the same format. Read the date column in as character and then you can convert to date format once the file is loaded. For example `tab2$Date = as.Date(tab2$Date, format="%m/%d/%Y")`. – eipi10 Apr 28 '15 at 19:56
  • The dates in the two csv files seem to be in the same format. Not sure how to test this, but in excel I tried doing cell entry +1 from one cell from each table with date 12/1/2014 and both showed 12/2/2014. I have also attached snapshots of the two csv s in the post. – Abhirup Datta Apr 28 '15 at 20:08

1 Answers1

3

This can be solved by reading in the dates as a character vector and then calling strptime() inside transform():

transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:

readStockData <- function(path) {
    tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
    tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
    tab;
};
readStockData(path1);
##         Date   Open   High    Low  Close  Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461
readStockData(path2);
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]) can be used for the detection.

bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • That works for the second table but not for the first. Unfortunately I have many such table to read and not sure how to automate the reading. `head(transform(read.csv(path1,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y'))))` ## Date Open High Low Close Volume Adj.Close ##1 158.35 162.92 157.12 157.12 2719100 156.1488 ##2 153.14 160.86 152.98 160.09 2243400 159.1004 – Abhirup Datta Apr 28 '15 at 20:14
  • Yeah the edited version works ... thanks a ton ... any idea why this weird issue comes up ? R or csv which one is to blame ? – Abhirup Datta Apr 28 '15 at 20:28
  • Since the date format in your second CSV file is ambiguous (could be `%d/%m/%Y` or `%m/%d/%Y`), there's no way for R to know how to convert it to Date, thus, at least in this particular case, R cannot be blamed. On the other hand, one could argue that the designers of R could have put more effort into doing dynamic date format detection out-of-the-box, since I believe `as.Date()` can only handle `%Y-%m-%d`, but maybe with some flexibility on the separator character. But I wouldn't blame R; importing data into any context usually requires strictness and consistency of input format. – bgoldst Apr 28 '15 at 20:33
  • 2
    Just realized that it is also important to point out that Excel does some fairly aggressive dynamic detection of date formats, and then displays all converted dates using the same format, so even though the underlying CSV files clearly have a different date format, Excel will mask that. You can use a lower-level text editor or data dump utility to view the raw CSV data to prove this. Thus, in case this wasn't clear in my above comment, the CSV data is to blame for the inconsistency. – bgoldst Apr 28 '15 at 20:55