0

I had an .rda file with a large list, that looked like this:

[[1]] Null
[[2]] Null
...
[[1000]] (Some data)
...

The first K empty rows (999 in the example) were created because of bug in the code, so I decided to delete all the 1:K rows. After saving the file it has grown large in size: before it was <1 GB and after it was >16GB. How could that be? How to fix it?

I can imagine that the problem is that before editing the list it had values from 1 to N, and after the editing it contains only values from K+1 to N, but is it so different? If this is the problem, how to clear the indexing?

Tim
  • 7,075
  • 6
  • 29
  • 58
  • this is interesting, but a (small!) reproducible example (using `save()` and `file.size()`) would be very useful. – Ben Bolker Nov 02 '14 at 18:02
  • 1
    What code did you use to remove the NULLs. Those aren't necessarily "rows" unless you had a data frame to begin with. Example of how row indexing fails `replicate(5, NULL)[3,]` – Rich Scriven Nov 02 '14 at 18:05

2 Answers2

3

I can't easily replicate this, but I offer this template: perhaps, as @RichardScriven comments above, you can tell us how you deleted the NULL values?

Make up data:

set.seed(101)
z1 <- replicate(1000,runif(1000),simplify=FALSE)
z1[1:500] <- replicate(500,NULL)

Save and check file size:

save("z1",file="tmp.rda")
file.size("tmp.rda")
## [1] 2666278

Keep only the last 500 elements:

z2 <- z1[501:1000]
save("z2",file="tmp2.rda")
file.size("tmp2.rda")
## [1] 2666249

Gets ever-so-slightly smaller.

Replacing NULL with numeric(0) makes the result ever-so-slightly larger.

z3 <- z1
z3[1:500] <- replicate(500,numeric(0))
save("z3",file="tmp3.rda")
file.size("tmp3.rda")
## [1] 2666290
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
2

The file might need a different compression type after removing the NULLs. It was probably uncompressed and then recompressed under the same compression scheme although it should have been different since the list got many times smaller.

From ?save

... a saved file can be uncompressed and re-compressed under a different compression scheme (and see resaveRdaFiles for a way to do so from within R).

So when I run resaveRdaFiles on the z2 object in Ben Bolker's answer, it gets a good chunk smaller

file.info("tmp2.rda")[,1]
# [1] 2666373
tools::resaveRdaFiles("tmp2.rda")
file.info("tmp2.rda")[,1]
# [1] 2210736
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • interesting, but I'm mildly skeptical. What does `resaveRdaFiles` do to `tmp.rda`? The default compression is `gzip` (see `?save`), whereas `resaveRdaFiles` tries out several different compression schemes and picks the best one. – Ben Bolker Nov 02 '14 at 18:47
  • @BenBolker - size is 2212664 on tmp.rda – Rich Scriven Nov 02 '14 at 19:05