I have a disk frame with these columns
key_a
key_b
key_c
value
Say the disk frame is 200M rows and I'd like to group it by key_b. Additionally, I want to keep the underlying disk frame in tact and unchanged so I could later on join it to something else on key_c or aggregate it on key_a. I'm concerned that srckeep affects the underlying disk frame.
Will either of these work? If so, can I expect one to be faster than the other?
df %>%
srckeep("value", "key_b") %>%
group_by(key_b) %>%
summarize(avg = mean(value)) %>%
collect
df[
keep = c("value", "key_b"
.(avg = mean(value)),
.(key_b)
]
How will either of these aggregations affect the underlying disk frame? I had an experience earlier where I assigned an aggregation to a variable, and then ran delete(aggregation
, but it deleted the entire disk frame.