I need to write a function that can smooth a vector without losing the original rank order of the vector's values. What I've come up with is the following:
#1 Sort all values of vector in ascending order
#2 for the kth value in vector s_k in the ordered list, collect the list of 2N+1 values in the window of values between [s_{k-N}, s_{k+N}]
#3 by definition, s_k is the median of the values in that window
#4 replace s_k with the mean of value in that same window for all values of k
Ideally, I'd like to be able to write a function that relies on dbplyr
as I'm working with remote data but this is not absolutely necessary as I can break the data out in chunks so base R is OK too. Similarly, this could also be all postgressql code or paritial sql partial dbplyr
and it would be just the same but there are a few requirements. I need to be able to parameterize N
and i need to be able to feed the function a list of dataframes or set of tables(if in a database) to loop through (in R this is simple, a function with a single parameter for N
inside a lapply
wrapper).
Here's what I got thus far for N=3
:
#Example Data
s <- rnorm(1000, mean=50, sd=10)
test.in <- as.data.frame(s)
test.in$id <- 1:length(s)
#Non parameterized attempt
test.out <- test.in %>%
rename(s = union_v_corporate_candidate) %>%
mutate(lag_k_3 = lag(s, 3),
lead_k_3 = lead(s, 3),
lag_k_2 = lag(s, 2),
lead_k_2 = lead(s, 2),
lag_k_1 = lag(s, 1),
lead_k_1 = lead(s, 1)) %>%
mutate(window_mean = (lag_k_3 + lead_k_3 + lag_k_2 + lead_k_2 + lag_k_1 + lead_k_1 + s)/7) %>%
select(id, s, window_mean)
The problem with the logic of above approach is that I wont be able parameterize N
as each additional N
value would require two additional mutate clauses.