2

I want to read in several fixed width format txt files into R but I first need to unzip them.

Since they are very large files I want to use read_fwf from the readr package because it's very fast.

When I do:

read_fwf(unz(zipfileName, fileName), fwf_widths(colWidths, col_names = colNames))

I get this error Error in isOpen(con) : invalid connection

However when I do:

read.table(unz(zipfileName, fileName)) without specfiying widths it reads into R just fine. Any thoughts as to why this isn't working with read_fwf ?

I am having trouble making a reproducible example. Here is what I got:

df <- data.frame(
  rnorm(100),
  rnorm(100)
)

write.table(df, "data.txt", row.names=F, col.names = F)
zip(zipfile = "data.zip", files = "data.txt")
colWidths <- rep(2, 100)
colNames <- c("thing1","thing2")
zipfileName <- "data.zip"
fileName <- "data.csv"
Warner
  • 1,353
  • 9
  • 23
  • I only see one column. I also do not see that you defined 'zipfileName' – IRTFM Jul 31 '16 at 16:14
  • @42- made edits to make example match the problem. – Warner Jul 31 '16 at 16:19
  • Read `?unz` more carefully. In particular: `"The 'description' is the full path to the zip file, with ‘.zip’ extension if required."` – IRTFM Jul 31 '16 at 16:23
  • In my actual code I have the full path to the zip file specified as `description` and the .txt file within the zip as `filename`. The problem is unzipping works fine when a base R function is wrapped around the `unz` function but when use `read_fwf` I get an error. – Warner Jul 31 '16 at 16:34
  • 1
    Exactly. So as the doctor (me) says: "If it always hurts when you twist my arm this way,.... then stop doing that". (At least until you send a bug report to Hadley.) – IRTFM Jul 31 '16 at 16:43

1 Answers1

3

I also had trouble getting read_fwf to read zip files when passing an unz-ed file to it but then reading the ?read_fwf page I see that zipped files are promised to be handled automagically. You didn't make a file that was a valid fwf as an example, since neither of the columns had constant positions but that is apparent with the output:

read_fwf(file="~/data.zip", fwf_widths(widths=rep(16,2) ,col_names = colNames) )
Warning: 1 parsing failure.
row    col expected actual
  3 thing2 16 chars     14
# A tibble: 100 x 2
             thing1               thing2
              <chr>                <chr>
1  1.37170820802141    -0.58354018425322
2  0.03608988699566 7 -0.402708262870141
3  1.02963272114 -1       .0644333112294
4  0.73546166509663  8 0.607941664550652
5  -1.5285547658079   -0.319983522035755
6  -1.4673290956901    0.523579231857175
7  0.24946312418273 9 -0.574046655188405
8  0.58126541455159 5 -0.406516495600345
9   1.5074477698981   -0.496512994239183
10 -2.2999905645658 8 -0.662667854341041
# ... with 90 more rows

The error you were getting was from the unz function because it expects a full path to a zip extension file (and apparently won't accept an implicit working directory location) as the "description" argument. It's second argument is the name of the compressed file inside the zip file. I think it returns a connection, but not of a type that read_fwf is able to process. Doing parsing by hand I see that the errors both of us got was from this section of code in read_connection:

> readr:::read_connection
function (con) 
{
    stopifnot(is.connection(con))
    if (!isOpen(con)) {
        open(con, "rb")
        on.exit(close(con), add = TRUE)
    }
    read_connection_(con)
}
<environment: namespace:readr>

You didn't give unz a valid "description" argument, and even if we did the effort to open with open(con, "rb") fails because of the lack of standardization in arguments in the various file handling functions.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    Fortunately in my case each zip file only contains the one file I want to read into R. I'm curious about how `read_fwf` would handle a zip file with several files. – Warner Jul 31 '16 at 16:49
  • 2
    When I gave it a zip file with two items it picked the first one. (I would have expanded the zip file to a full directory and then worked with that if I wanted the second or later file.) – IRTFM Jul 31 '16 at 16:50
  • Thanks for checking this. This is good to know and I think has some implications about how `read_fwf` should handle zip files. – Warner Jul 31 '16 at 16:56