tl.dr. I have an aggregation problem that I haven't seen in documentation before. I manage to get it done, but it is way too slow for the intended application. The data I usually work with have around 500 lines (my gut feeling tells me this isn't much for dplyr) and according to system.time
it runs for about 4 s. My dilemma is I want to run it in an optimisation repeatedly and currently I am looking at hours of run time.
Do you see anything where I can shave off some time?
If need be I can also send some data I work with.
Algorithm I have a data set:
sample_dataset <- data_frame( starts = c(1000, 1008, 1017, 2000, 2020, 3000),
ends = c(1009, 1015, 1020, 2015, 2030, 3010),
v = list(rep(1,10), rep(2,8),rep(3,4),
rep(4,16), rep(5,11), rep(6,11)) )
so each line encodes a signal and a start and end index. I want to aggregate all lines which have less than closeness
(e.g. 10) distance into an single line. In case it matters starts
is ordered.
The output should be:
structure(list(inds = 1:3, starts = c(1000, 2000, 3000), ends = c(1020,
2030, 3010), v = list(c(1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 2, 2, 2,
2, 2, 2, 0, 3, 3, 3, 3), c(4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), c(6,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .Names = c("inds", "starts", "ends",
"v"))
So the first three lines in the original data set are aggregated, line 4 and 5 aggregated, 6 unchanged. For overlaps the numbers should be added, for gaps zeros filled in. The updated starts value is the first starts, the updated ends should be the last ends (suppose I should fix it to the max). But by the way these are generated end should be also sorted. The case that one block is completely surrounded by another block should not occur.
I achieved this by following code:
Code
library(dplyr)
join_lines <- function(dfi) {
if (nrow(dfi)==1) return(select(dfi,starts,ends, v))
else
with(dfi,{
start <- starts[[1]]
end <- ends[[length(ends)]]
vals <- numeric(end-start+1)
add_val <- function(ddf)
with(ddf,{
vals[(starts-start+1) : (ends-start+1)] <<-
vals[(starts-start+1) : (ends-start+1)] + v })
dfi %>% rowwise() %>% do(tmp=add_val(.))
data_frame(starts=start, ends=end, v=list(vals))})
}
simplify_semisparse <- function(aframe, closeness = 10){
aframe %>%
mutate( join_pre = lag(ends, default=0)+closeness >= (starts),
inds = cumsum(!join_pre)
) %>%
group_by(inds) %>% do(join_lines(.)) %>% ungroup()
}
res <- simplify_semisparse(sample_dataset)
dput(res) # see above
Background
The data I am dealing with is from mass spectrometry. It's very peculiar in that a vector has around 500,000 entries and less than 10% of these are not zeros, a typical spectrum has around 500 such dense blocks. I do need to quickly interpolate values along such a spectrum - my idea was to use approx
in the "dense" regions.
Comparison of suggestions
I had the chance of comparing your suggestions.
The results produced by @matt-jewett solution did not agree with my intended ones so I did exclude it.
@jeremycgs solution was closest to my original approach, but also did not produce exactly the same results.
most important is for my the runtime, I am comparing using production data. My original solution took 2.165 s. @tjeremy s suggestion took 0.532 s and @uwe-block 0.012 s.
So wow - I need to learn data.table.