2

I am currently attempting to use read_table() function from the readr package on a few large data files. I only want the second column so I set all the other columns NULL with this argument in the function:

col_types = c(paste("_", "c", paste(rep("_", 20000), sep = "", collapse = ""), sep = "", collapse  = ""))

EDIT: There should be an underdash between the 1st and 3rd pair of closed quotes in the code above.

However, read_table seems to insist on reading in the entire data file (And using up excessive memory and causing a crash) instead of just reading in column 2.

With read.table(), I have tried a similar argument: colClasses = c("NULL", "character", rep("NULL", 20000) which works perfectly without taking up excess memory but I would like to use read_table since it is supposedly faster. Any ideas on why read_table is taking up so much memory even though I am including an argument to only keep one column?

Jaap
  • 81,064
  • 34
  • 182
  • 193

1 Answers1

4

If you only want to read the second column of a large data file, you can also use the fread function from the data.table package. The fread function was also developed for (very) fast file reading.

fread has a select argument with which you can determine which columns to load. In your case it would be something like:

dt <- fread("name_of_file.csv", select=2)

This selects only the second column. You can also give it a vector of columns:

dt <- fread("name_of_file.csv", select=c(2,5,10))

or a vector of column names:

dt <- fread("name_of_file.csv", select=c("id","time"))
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • I would definitely prefer the fread() function but unfortunately this function results in an error message for me: embedded nul in string 'àáõ“Ô\003\0IÏøá4ÔZM2Ì' – user2205537 Aug 21 '15 at 08:54
  • @user2205537 could you give the full error message (including the command you are using)? – Jaap Aug 21 '15 at 09:48
  • Nevermind! Problem solved! It was the .gz compressed file format being incompatible with fread(). Thanks for your help! – user2205537 Aug 21 '15 at 16:48
  • @user2205537 That's indeed not possible yet. However, such a [feature will be incorporated in the future](https://github.com/Rdatatable/data.table/issues/717). – Jaap Aug 21 '15 at 18:51