15

R 3.1.0 is out and one of the new features is the following:

type.convert() (and hence by default read.table()) returns a character vector or factor when representing a numeric input as a double would lose accuracy. Similarly for complex inputs.

To give an example:

df <- read.table(text = "num1 num2
1.1 1.1234567890123456
2.2 2.2
3.3 3.3", header = TRUE)

sapply(df, class)
#      num1      num2 
# "numeric"  "factor"

while with previous versions, read.table would have returned two numeric columns.

For those who like me are a concerned about that change, what can be done to preserve the old behavior?

Note: I'd like a general solution that does not make assumptions on the input data, i.e. do not suggest I use colClasses = "numeric" in the example above. Thanks.

flodel
  • 87,577
  • 21
  • 185
  • 223
  • 2
    Open the R 3.0.3 tarball, extract the relevant code, package it as 'myread.table', ... – Dirk Eddelbuettel Apr 15 '14 at 01:15
  • 2
    @Dirk, Note that the relevant routine, `type.convert`, is written in C, not R, so that is not as straight forward as if it were written in R. – G. Grothendieck Apr 15 '14 at 03:43
  • What exactly is the issue here? Previously your values would have been truncated and now you are at least notified implicitly when that would have occurred. If you want them in truncated form, just run `as.numeric` on the variable to numeric after loading. – Thomas Apr 15 '14 at 05:04
  • @Thomas, keep in mind that I am looking at a general solution: imagine I have a file with thousands of rows and columns and no apriori knowledge on what type of the data each column holds. It was `type.convert`'s job to tell me if a column was numeric or not, and convert it. Now, when I get a factor, I have no easy way of telling if it is because the column contained characters ("apple") or long numerics (1.1234567890123456). That's a problem. – flodel Apr 15 '14 at 11:06
  • 2
    Another problem is that a lot of my code processing files might stop working or worse, start reporting bogus data without warning. Why? Because long numbers that used to be converted to numeric are now read as factors, hence converted to integers when processed in a numeric context. That's very bad. – flodel Apr 15 '14 at 11:08
  • 1
    I agree with you on the reproducibility point, which is definitely a big issue! It potentially breaks old code, but I don't think it's bad behavior for new situations. I'd rather find out that I'm losing precision than have that precision silently discarded. – Thomas Apr 15 '14 at 11:47
  • 2
    I've also found this change is impacting numeric data returned from RODBC queries, and at this time there doesn't appear to be a colClasses option for these functions. – user338714 Apr 15 '14 at 16:35

3 Answers3

11

In version 3.1.1, there is this change listed in the News file:

type.convert(), read.table() and similar read.*() functions get a new numerals argument, specifying how numeric input is converted when its conversion to double precision loses accuracy. The default numerals = "allow.loss" allows accuracy loss, as in R versions before 3.1.0.

Much of post-release discussion about the original change, including the decisions to revert the default behavior with an additional warning, can be found in a thread on the developers' email list.

For version 3.1.0, code will have to be modified to get the old behavior. Switching to 3.1.1 is another strategy.

The mention of this change for version 3.1.0 (from the same News file) says

type.convert() (and hence by default read.table()) returns a character vector or factor when representing a numeric input as a double would lose accuracy. Similarly for complex inputs.

If a file contains numeric data with unrepresentable numbers of decimal places that are intended to be read as numeric, specify colClasses in read.table() to be "numeric".

Note: original answer was written when the applicable version with the fix was 3.1.0 patched. The answer has been updated now that 3.1.1 has been released.

Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
4

Try using data.table's fread:

# create test data set "a.dat"
Lines <- "num1 num2\n1.1 1.1234567890123456\n2.2 2.2\n3.3 3.3\n"
cat(Lines, file = "a.dat")

#####

library(data.table)

DT <- fread("a.dat")
str(DT)
## Classes ‘data.table’ and 'data.frame':  3 obs. of  2 variables:
## $ num1: num  1.1 2.2 3.3
## $ num2: num  1.12 2.2 3.3
## - attr(*, ".internal.selfref")=<externalptr> 

class(DT)
## [1] "data.table" "data.frame"

DF <- as.data.frame(DT) 
class(DF)
## [1] "data.frame"

ADDED LATER Since this answer was posted the latest patched version of R 3.1.0 has come out and by default reverts back to the old behavior with a new numerals argument to specify it differently. See type.convert and read.table

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
3

Since I don't have rep to comment on Brian Diggs's response - for future reference, the new argument is now called "numerals" (not "exact"). From http://cran.r-project.org/bin/windows/base/NEWS.R-3.1.0patched.html:

type.convert(), read.table() and similar read.*() functions get a new numerals argument, specifying how numeric input is converted when its conversion to double precision loses accuracy. The default numerals = "allow.loss" allows accuracy loss, as in R versions before 3.1.0.

tom m
  • 31
  • 2