3

I'm trying to load data from data frame objects of all .RData files in a specified directory into a single data table. This is how I've tried to do this:

library(data.table)

fileList <- list.files("../cache/FLOSSmole", pattern="\\.RData$", full.names=TRUE)
dataset <- rbindlist(lapply(fileList, FUN=function(file) {as.data.table(load(file))}))

However, the result is different from the expected (single data table containing all data) - it contains just names of data frame objects from the source .RData files:

> str(dataset)
Classes ‘data.table’ and 'data.frame':  39 obs. of  1 variable:
 $ V1: chr  "lpdOfficialBugTags" "lpdLicenses" "lpdMilestones" "lpdSeries" ...
 - attr(*, ".internal.selfref")=<externalptr>
> head(dataset)
                        V1
1:      lpdOfficialBugTags
2:             lpdLicenses
3:           lpdMilestones
4:               lpdSeries
5:             lpdProjects
6: lpdProgrammingLanguages

What am I doing wrong? Your help is greatly appreciated!

My R environment:

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.9.2

loaded via a namespace (and not attached):
[1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  stringr_0.6.2 tools_3.1.0
Arun
  • 116,683
  • 26
  • 284
  • 387
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • @Arun: Thank you for cleaning up the tags list. – Aleksandr Blekh Jul 08 '14 at 08:57
  • 1
    A much easier, faster and simpler way is to use `fread` instead of `.Rdata`. `rbindlist(lapply(files, fread))`... – Arun Jul 08 '14 at 09:07
  • @Arun: I have seen this code earlier when researching the topic, but my understanding is that `fread()` is a `data.table`'s alternative for `read.table()`. If I collect data from original sources and want to store it efficiently as an application's cache (`.RData` or `.rds`), I still have to use `save()` and `load()` or `saveRDS()` and `readRDS()`, correspondingly, don't I? – Aleksandr Blekh Jul 08 '14 at 10:36

1 Answers1

5

.RData is a saved workspace, it may contain data frames, but it is not a data frame. How many data frames are there in each .RData? You can load multiple .RData files and they add to the current workspace. Just load them all then merge or rbind the data.frames once they are in your current workspace

# lapply(FileList,function(x) load(x)) # Changed to a for loop, I guess the lapply was only loading into the lapply environment which disappears when the function ends
for (i in 1:length(FileList)) {
   load(FileList[i])
}
my.list <- vector(length(ls()),mode="list")
for (i in 1:length(ls())) {
    my.list[[i]] <- get(ls()[i])
}
my.rbind <- do.call(rbind,my.list)

This is one way. An easier way would be to save individual tables as delimited text files in the first place.

JeremyS
  • 3,497
  • 1
  • 17
  • 19
  • Thank you for the answer! Currently, each `.RData` file contains a single data frame. I left such design for this piece of code (instead of using `.rds` files for storing single objects, as I do in other places), as I plan to implement a single multi-object `.RData` file solution. I do merge data tables, but after converting data frames to data tables first. So, what's wrong? – Aleksandr Blekh Jul 08 '14 at 08:02
  • 1
    `.RData`, even if it only contains a single data.frame, can't be used as though it was a data.frame. – JeremyS Jul 08 '14 at 08:05
  • 1
    `load(.RData)` adds .RData to your current workspace. You need to call `as.data.table` on the data.frame within that workspace, not on the load command. – JeremyS Jul 08 '14 at 08:09
  • I see. I guess, I understand now... The underlying reason for my problem is that `load()` returns "a character vector of the names of objects created". But how do I get the reference to a data frame object to pass to `as.data.table()`? I'm thinking about using `get()` for this. – Aleksandr Blekh Jul 08 '14 at 08:13
  • I'm confused: when I call `load()` on an individual `.RData` file, corresponding data frame is loaded into my global workspace, but when I do the same for all files via `lapply()`, no data frames are loaded. – Aleksandr Blekh Jul 08 '14 at 08:28
  • 1
    I changed it to a for loop which is often better for modifying something in the global environment and added how to make a list of the data frames after you load them. Moral of the story: individual `.RData` files is not wise. – JeremyS Jul 08 '14 at 08:33
  • Thank you very much! So, I was right about `get()` - at least, I'm not completely lost in the ocean of `R`, but sometimes I feel that way... :-). Accepting your answer and thank you again! – Aleksandr Blekh Jul 08 '14 at 08:39
  • In regard to your note on saving individual tables, the whole point of my efforts to merge data is to make it easier to access/query data that I need for further statistical analysis. – Aleksandr Blekh Jul 08 '14 at 08:44