0

I have a disk frame with these columns

key_a
key_b
key_c
value

Say the disk frame is 200M rows and I'd like to group it by key_b. Additionally, I want to keep the underlying disk frame in tact and unchanged so I could later on join it to something else on key_c or aggregate it on key_a. I'm concerned that srckeep affects the underlying disk frame.

Will either of these work? If so, can I expect one to be faster than the other?

  df %>% 
  srckeep("value", "key_b") %>%
  group_by(key_b) %>% 
  summarize(avg = mean(value)) %>% 
  collect
  df[
    keep = c("value", "key_b" 
    .(avg = mean(value)),
    .(key_b)
    ]

How will either of these aggregations affect the underlying disk frame? I had an experience earlier where I assigned an aggregation to a variable, and then ran delete(aggregation, but it deleted the entire disk frame.

Cauder
  • 2,157
  • 4
  • 30
  • 69
  • 1
    `disk.frame` by default uses `data.table`. So, it may be more natural to use data.table syntax as it can be faster – akrun Sep 11 '20 at 15:57
  • That's awesome. Does that mean I can change a disk frame on reference? – Cauder Sep 11 '20 at 15:58
  • 1
    I assume it is the case. Please do check the vignettes – akrun Sep 11 '20 at 15:59
  • Actually dplyr syntax is much better supported atm due to my bandwidth limitations. You can't really change the disk.frame on disk unless you overwrite it. I need to make this more clear in the docs etc. – xiaodai Sep 17 '20 at 02:26

1 Answers1

1

When you apply an operation, it doesn't change the underly disk.frame at all!

srckeep only affects what gets used! It loads only those columns in srckeep in memory when doing the processing. Again, it doesn't affect the underlying data at all.

Unless you do write_disk.frame(some_other_diskf, "to/location_of_disk.frame.df", overwrite=TRUE) which will overwrite the old disk.frame.

The disk.frame is always on disk. You can see where it is with attr(diskf, "path")

xiaodai
  • 14,889
  • 18
  • 76
  • 140