0

I'm attempting to read a .gz-file using data.tables fread-function. I have tried the syntax suggested here:

dt = fread("gunzip -c myfile.gz")

but I get a verbose error message:

Error in fread("gunzip -c myfile.gz") : 
  File is empty: C:\Users\MARK~1.MUR\AppData\Local\Temp\RtmpIBawPA\file498c1c4114ef
In addition: Warning messages:
1: running command 'C:\Windows\system32\cmd.exe /c (gunzip -c myfile.gz) > C:\Users\MARK~1.MUR\AppData\Local\Temp\RtmpIBawPA\file498c1c4114ef' had status 1 
2: In shell(paste("(", input, ") > ", tt, sep = "")) :
  '(gunzip -c 180227.2101.2017.MRE.csv.gz) > C:\Users\MARK~1.MUR\AppData\Local\Temp\RtmpIBawPA\file498c1c4114ef' execution failed with error code 1

My guess here is that access to a temporary file is being denied by my IT masters (?). If this is the case how do I set the temporary file path to say the current directory for the unzip?

Jaap
  • 81,064
  • 34
  • 182
  • 193
Markm0705
  • 1,340
  • 1
  • 13
  • 31
  • As you are on a Windows machine I suspect you don't have access to command line tools, which might be the reason for this. – Jaap Feb 27 '18 at 12:13
  • When I use the code in the linked Q&A on macOS, it works; but when I use it on a Window VM, it doesn't. Could you try with `fread(unzip('myfile.gz'))`? – Jaap Feb 27 '18 at 12:22
  • For `.gz`-files you need `gunzip` function from `R.utils`. See also the update of my answer. HTH – Jaap Feb 27 '18 at 13:11

2 Answers2

3

As you are on a Windows PC you probably don't have access to command line tools, which might be the reason for this.

A possible solution might be to unzip first and then read with fread. The following example works on my Windows VM:

write.csv(mtcars, 'mtcars.csv')
zip('mtcars.csv.zip', 'mtcars.csv')

unzip('mtcars.csv.zip')
fread('mtcars.csv')

For .gz files, you can use the gunzip function from R.utils. The following example works for me:

write.csv(mtcars, gzfile('mtcars2.csv.gz'))

library(R.utils)
gunzip('mtcars2.csv.gz')
fread('mtcars2.csv')

Consequently, you might need something like this:

library(R.utils)
gunzip('myfile.gz')
fread('myfile.csv')
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Appreciated - I'm sure this works OK as I have loaded the file successfully (83 million rows by 15 columns) with fread using the unzipped file csv file as the source. However, I was curious to see if fread could read directly from the compressed file and perhaps even a faster read (this took 48 minutes - which was about the same speed as read.table on the same laptop. Perhpas I will try read.table on the compressed file next – Markm0705 Feb 27 '18 at 20:49
  • @Markm0705 Currently, you can't read zipped files directly with `fread`. There is a [feature request on GitHub](https://github.com/Rdatatable/data.table/issues/717) about this, but it isn't implemented yet. – Jaap Feb 27 '18 at 21:07
  • @Markm0705 there is no way fread gives you comparable results than read.table. Of my laptop loading 83 millions rows and 15 columns takes 2 minutes... Are you reading the file from the network or something like that? There are many ways to speed this up, using multithreading reading part of this file, using a binary format and package fst. But fread is the fastest file reader in R, for sure – statquant Feb 28 '18 at 11:56
  • Perhaps I'm missing something but pretty sure Igetting comparable results for this file (reading from local hard drive) will give it the pepsi challenge again if I get the chance and report back... – Markm0705 Mar 01 '18 at 02:25
0

Try read_csv() from the readr package, which handles .gz automatically:

dt = as.data.table(read_csv("myfile.gz"))

(or another read_* function if it's not a csv)

webb
  • 4,180
  • 1
  • 17
  • 26