5

I am reading data from a .csv file using data.table::fread on a Windows 10 computer. The data reads in properly through read.csv; however, when I use fread to read in the data, all of the final columns in each row of the resulting data.table ends in a \r, presumably indicating a carriage return. This causes numeric fields to be given a character datatype. (Instead of a numeric literal 4.53, a row-ending cell will contain a character literal 4.53\r.)

Why is this bug occurring? Is there a way to directly resolve this through the function call of fread?

Update

I get the following when the verbose = TRUE parameter is used

Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000001 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 7 columns. Longest stretch was from line 1 to line 13
Starting data input on line 1 (either column names or first row of data). First 10 characters: subjectNum
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 13 (including 1 at the end)
Count of sep: 72
nrow = MIN( nsep [72] / ncol [7] -1, neol [13] - nblank [1] ) = 12
Type codes (   first 5 rows): 1131414
Type codes: 1131414 (after applying colClasses and integer64)
Type codes: 1131414 (after applying drop or select (if supplied)
Allocating 7 column slots (7 - 0 dropped)
Read 12 rows. Exactly what was estimated and allocated up front
   0.000s (  0%) Memory map (rerun may be quicker)
   0.001s ( 33%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.002s ( 67%) Column type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 12x7 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.003s        Total
Bob
  • 451
  • 1
  • 5
  • 12
  • You can try to make a reproducible example, like `fread("a\n1\r\n2\r\n")` maybe? In this case, the end-of-line indicators are inconsistent, leading to the behavior you see. – Frank Jun 16 '16 at 22:49
  • This indeed leads to the error in R. When I read the file in Notepad++, the file has an LF only on the first line and CR LF (\r\n) on subsequent lines. Please feel free to submit an answer so I can accept your answer. – Bob Jun 16 '16 at 23:04
  • Do you know if this a common occurrence with .csv files? – Bob Jun 16 '16 at 23:05

1 Answers1

6

If you have a file that looks like x="a\n1\r\n2\r\n", then fread(x) will give the result described:

     a
1: 1\r
2: 2\r

This occurs because the end-of-line indicators are inconsistent across lines.

I have heard of this happening to others, but I'm not sure where it comes from or whether there is a better way to address it than "fixing" the file, probably with a command-line tool.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • I think I saw someone with this problem on the mailing list or github, but can't find the link. – Frank Jun 16 '16 at 23:56
  • 1
    It happened to me when I created data with python, hard-coded the header using `\n` in Windows. [python os.linesep](http://stackoverflow.com/questions/1223289) – user3226167 Feb 09 '17 at 01:44