How to load big csv file with mixed-type columns using the bigmemory package

Question

Is there a way to combine the use of scan() and read.big.matrix() from the bigmemory package to read in a 200 MB .csv file with mixed-type columns so that the result is a dataframe with integer, character, and numeric columns?

does it have to be the bigmemory package? I find ff much more useful for this sort of stuff — mdsumner, Aug 07 '11 at 12:17
@mdsumner is on the right track. Does it even need to be file-backed? For 200MB, I'd just read it in, work with it, then save it as 1 or more BM files (or in `ff`, if you wish). — Iterator, Aug 07 '11 at 12:31

mdsumner · Answer 1 · 2014-05-03T01:08:40.160

9

Try the ff package for this.

library(ff)
help(read.table.ffdf)

Function ‘read.table.ffdf’ reads separated flat files into ‘ffdf’ objects, very much like (and using) ‘read.table’. It can also work with any convenience wrappers like ‘read.csv’ and provides its own convenience wrapper (e.g. ‘read.csv.ffdf’) for R's usual wrappers.

For 200Mb it should be as simple a task as this.

 x <- read.csv.ffdf(file=csvfile)

(For much bigger files it will likely require that you investigate some of the configuration options, depending on your machine and OS).

edited May 03 '14 at 01:08

answered Aug 07 '11 at 12:15

mdsumner

29,099
6
83
91

Thank you mdsummer. I tried the ff package, was able to read in the almost 300 MB dataset which I stored into an object which I later coerced into a dataframe with as.data.frame. However, this ate up so much memory that there was little left for analysis. It was a good start though and a helpful suggestion. – Lourdes Aug 15 '11 at 09:48
The entire point is not to load it all in but use the memory-mapped features of the ff package. There are tools to extract portions from the ff data structures – mdsumner Aug 15 '11 at 11:37

score 7 · Answer 2 · answered Aug 07 '11 at 12:21

Ah, there are some things that are impossible in this life, and there are some that are misunderstood and lead to unpleasant situations. @Roman is right: a matrix must be of one atomic type. It's not a dataframe.

Since a matrix must be of one type, attempting to snooker bigmemory to handle multiple types is, in itself, a bad thing. Could it be done? I'm not going there. Why? Because everything else will assume that it's getting a matrix, not a dataframe. That will lead to more questions and more sorrow.

Now, what you can do is to identify the types of each of the columns, and generate a set of distinct bigmemory files, each containing the items that are of a particular type. E.g. charBM = character big matrix, intBM = integer big matrix, and so on. Then, you may be able to develop have a wrapper that produces a data frame out of all of this. Still I don't recommend that: treat the different items as what they are, or coerce homogeneity if you can, rather than try to produce a big dataframe griffin.

@mdsumner is correct in suggesting ff. Another storage option is HDF5, which you can access through ncdf4 in R. Unfortunately, these other packages are not as pleasant as bigmemory.

Thanks Iterator. You are right, the other packages are not as pleasant as bigmemory. — Lourdes, Aug 15 '11 at 09:49

score 4 · Answer 3 · answered Aug 07 '11 at 06:37

4

According to the help file, no.

Files must contain only one atomic type (all integer, for example). You, the user, should know whether your file has row and/or column names, and various combinations of options should be helpful in obtaining the desired behavior.

I'm not familiar with this package/function, but in R, matrices can have only one atomic type (unlike e.g. data.frames).

answered Aug 07 '11 at 06:37

Roman Luštrik

69,533
24
154
197

Thanks for your two cents. On this blog, http://joshpaulson.wordpress.com/2010/12/20/michael-kane-on-bigmemory/ someone suggested that a workaround to the limitation of matrices having only one atomic type (a characteristic inherited by big.matrix) is to use scan(). I was hoping someone could share their experiences with read.big.matrix from the bigmemory package, especially with regards to reading in mixed-type columns and whether they have used scan(). – Lourdes Aug 07 '11 at 06:56
Maybe you can do that in the processing stage, but I would like to be proven wrong (sensu @Iterator). – Roman Luštrik Aug 07 '11 at 09:26

score 0 · Answer 4 · answered Mar 23 '13 at 14:13

0

The best solution is to read the file line by line and parse it, in this way the reading process will occupy an amount of memory almost linear.

answered Mar 23 '13 at 14:13

Claudio Martines

58
5

Welcome to StackOverflow! However, this does not answer the question, which was specifically aimed at the bigmemory package – Paul Hiemstra Mar 23 '13 at 14:17

How to load big csv file with mixed-type columns using the bigmemory package

4 Answers4