15

Let's say I have this txt file:

"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2

With read.csv I can:

> read.csv("linktofile.txt", fill=T, header=F)
  V1 V2 V3 V4 V5 V6 V7
1 AA  3  3  3  3 NA NA
2 CC ad  2  2  2  2  2
3 ZZ  2 NA NA NA NA NA
4 AA  3  3  3  3 NA NA
5 CC ad  2  2  2  2  2

However fread gives

> library(data.table)

> fread("linktofile.txt")
   V1 V2 V3 V4 V5 V6 V7
1: CC ad  2  2  2  2  2

Can I get the same result with fread?

Nicholas Post
  • 1,857
  • 1
  • 18
  • 31
nigmastar
  • 470
  • 5
  • 15

2 Answers2

9

Major update

It looks like development plans for fread changed and fread has now gained a fill argument.

Using the same sample data from the end of this answer, here's what I get:

library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
#    V1 V2 V3 V4 V5 V6 V7
# 1: AA  3  3  3  3 NA NA
# 2: CC ad  2  2  2  2  2
# 3: ZZ  2 NA NA NA NA NA
# 4: AA  3  3  3  3 NA NA
# 5: CC ad  2  2  2  2  2

Install the development version of "data.table" with:

install.packages("data.table", 
                 repos = "https://Rdatatable.github.io/data.table", 
                 type = "source")

Original answer

This doesn't answer your question about fread: That question has already been addressed by @Matt.

It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.

Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.

You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.

library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
           col_types = rep("character", max(count.fields(x, ","))))

Sample data

x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")

## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
8

Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.

Could you add it to the list please? That way you'll get notified when its status changes.

Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.

UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • All industry data flows in the UK Utility are like that 4 example. Millions text files are sent through the parties containing different records (rows to be then inserted in different table, above can be `AA`,`CC`, `ZZ` names of different tables). All records inside a file are related to the same industry process (so they are sent together, but also to save space) and once split you want to created primary and secondary key in SQL, or use `roll` with file name and line import number in `data.table` (thanks a LOT for that!) – Michele Sep 03 '13 at 23:51
  • @Michele Thanks for great info. Are those files large; i.e., is there a speed issue with `read.csv` on them? – Matt Dowle Sep 04 '13 at 09:29
  • When I perform data import the I have in general thousands of file (with 50 to 200000 records) at the beginning a new project ( then it's just to import new coming files every day). `read.csv` saves me a lot of time and it's way faster of our official ( I do/can use R only for prototying) T-SQL import scripts. The only issue is: I put `read.csv` in a `for` and in each cycle the file is entirely stored in a list item. After few thousands of files the process (I've got progress bar with average speed) gets much slower, and I guess it's the list becoming huge... [continue] – Michele Sep 04 '13 at 09:38
  • I update the list with [this](http://stackoverflow.com/questions/9031819/add-named-vector-to-a-list/12978667#12978667) `lappend` function and copies are so made every time. Besides improves the list update, I just thought to use `fread` to improve the time of the 'pure' import from csv to whatever `R` object. – Michele Sep 04 '13 at 09:41
  • @Michele It's going to be difficult to change `fread` to cope with such files. It's optimized for regular delimited format where each row has the same number of columns. The right number of columns are allocated up front for the right number of rows, etc. But, such files could be read into `list` columns perhaps. – Matt Dowle Sep 04 '13 at 10:25
  • I understand. Essentially what fill does is like setting the number of column to `max(count.fields())`, even though it uses something different and looks only in the first x rows, but I uses the above to override. I'll try to explore some other ways then, thanks again!! – Michele Sep 04 '13 at 10:45
  • @Michele, FYI: `fread` now has a `fill` argument in the development version. – A5C1D2H2I1M1N2O1R2T1 Mar 09 '16 at 11:01