.rda file containing large list grows in size after deleting some rows

Question

I had an .rda file with a large list, that looked like this:

[[1]] Null
[[2]] Null
...
[[1000]] (Some data)
...

The first K empty rows (999 in the example) were created because of bug in the code, so I decided to delete all the 1:K rows. After saving the file it has grown large in size: before it was <1 GB and after it was >16GB. How could that be? How to fix it?

I can imagine that the problem is that before editing the list it had values from 1 to N, and after the editing it contains only values from K+1 to N, but is it so different? If this is the problem, how to clear the indexing?

this is interesting, but a (small!) reproducible example (using `save()` and `file.size()`) would be very useful. — Ben Bolker, Nov 02 '14 at 18:02
What code did you use to remove the NULLs. Those aren't necessarily "rows" unless you had a data frame to begin with. Example of how row indexing fails `replicate(5, NULL)[3,]` — Rich Scriven, Nov 02 '14 at 18:05

score 3 · Answer 1 · answered Nov 02 '14 at 18:09

3

I can't easily replicate this, but I offer this template: perhaps, as @RichardScriven comments above, you can tell us how you deleted the NULL values?

Make up data:

set.seed(101)
z1 <- replicate(1000,runif(1000),simplify=FALSE)
z1[1:500] <- replicate(500,NULL)

Save and check file size:

save("z1",file="tmp.rda")
file.size("tmp.rda")
## [1] 2666278

Keep only the last 500 elements:

z2 <- z1[501:1000]
save("z2",file="tmp2.rda")
file.size("tmp2.rda")
## [1] 2666249

Gets ever-so-slightly smaller.

Replacing NULL with numeric(0) makes the result ever-so-slightly larger.

z3 <- z1
z3[1:500] <- replicate(500,numeric(0))
save("z3",file="tmp3.rda")
file.size("tmp3.rda")
## [1] 2666290

answered Nov 02 '14 at 18:09

Ben Bolker

211,554
25
370
453

`file.size` not found. Forgot where it is...`utils`? – Rich Scriven Nov 02 '14 at 18:20
I have it in base. Could it be platform-specific? `file.info()` should work anywhere, I think. – Ben Bolker Nov 02 '14 at 18:44
No, on Linux there's no file.size – Tim Nov 03 '14 at 06:30
Unfortunately I am unable to provide a reproducible example. I tried few different approaches but, like you, I was not able to produce any. – Tim Nov 03 '14 at 06:58
then I don't think we're going to be able to help, unless someone gets a flash of inspiration ... – Ben Bolker Nov 03 '14 at 11:53
You asked how did I delete `Null`'s: `while(is.null(data[[1]])) data <- data[-1]` – Tim Nov 03 '14 at 18:11

Rich Scriven · Accepted Answer · 2014-11-02T18:41:05.677

2

The file might need a different compression type after removing the NULLs. It was probably uncompressed and then recompressed under the same compression scheme although it should have been different since the list got many times smaller.

From ?save

... a saved file can be uncompressed and re-compressed under a different compression scheme (and see resaveRdaFiles for a way to do so from within R).

So when I run resaveRdaFiles on the z2 object in Ben Bolker's answer, it gets a good chunk smaller

file.info("tmp2.rda")[,1]
# [1] 2666373
tools::resaveRdaFiles("tmp2.rda")
file.info("tmp2.rda")[,1]
# [1] 2210736

edited Nov 02 '14 at 18:41

answered Nov 02 '14 at 18:27

Rich Scriven

97,041
11
181
245

interesting, but I'm mildly skeptical. What does `resaveRdaFiles` do to `tmp.rda`? The default compression is `gzip` (see `?save`), whereas `resaveRdaFiles` tries out several different compression schemes and picks the best one. – Ben Bolker Nov 02 '14 at 18:47
@BenBolker - size is 2212664 on tmp.rda – Rich Scriven Nov 02 '14 at 19:05

.rda file containing large list grows in size after deleting some rows

2 Answers2