Iterate a filter function over a data frame using start and end date variables with a Purrr map() function

Question

This is an example of the messy data I have to work with. Note the significant month long gaps in the date variable, and multiple sales on the same date.

    > print(sold_data, n = 36)
# A tibble: 36 × 2
       close_date close_price
       <date>           <dbl>
     1 2021-04-13      820000
     2 2021-06-18      780000
     3 2021-06-22      995000
     4 2021-09-07      905000
     5 2021-10-22     1250000
     6 2021-12-07      950000
     7 2021-12-14     1025000
     8 2021-12-21      860000
     9 2021-12-30     1040000
    10 2022-01-04     1200000
    11 2022-01-19      870800
    12 2022-01-24      985000
    13 2022-02-04     1020000
    14 2022-03-09     1504500
    15 2022-03-10     1015000
    16 2022-04-04      950000
    17 2022-05-17     1575000
    18 2022-05-18     1070000
    19 2022-05-24     1174000
    20 2022-05-25     1250000
    21 2022-06-10      900000
    22 2022-06-10      885000
    23 2022-07-01     1200000
    24 2022-07-08      850000
    25 2022-07-29     1282500
    26 2022-08-16     1200000
    27 2022-08-22      895000
    28 2022-09-02      975000
    29 2022-09-16     1330000
    30 2022-09-21     1200000
    31 2022-09-30      889000
    32 2022-11-07     1086365
    33 2022-11-22     1232000
    34 2022-12-01     1075000
    35 2023-02-14     1000000
    36 2023-03-13     1153000

This is the code I wrote for a start and end date sequence that is three months apart.

effective_date <- ymd("2023-03-25")

n_month_seq <- 1:24 * months (1)

end_24 <- effective_date - years (2) + n_month_seq

start_24 <- effective_date - years (2) - months(3) + n_month_seq

Then this is the code I wrote for the summary function

n_month_summary <- function(data) {
  summarise(data, min_price = min(close_price), max_price = max(close_price), 
            med_price = median(close_price), count = n())
}

Then this is the simple code to filter the sold_data tibble by each start and end date, then run the summary function.

sum_sold_data <- sold_data %>%
  filter(close_date > start_24[1], 
         close_date <= end_24[1]) %>%
  n_month_summary()

This code all works, except only one line at a time. I went through a purrr tutorial on datacamp, but I'm not skilled enough to know how to iterate my filter code through the start and end date lists to get this final result that I want. I am embarrassed to point out I ran the above filter code 24 separate times, then using cbind to create a data frame of min, max, med_price and count. Then using rbind and relocate functions added the start_24 and end_24 date sequences to get the following data frame that is what I want.

> print(sum_sold_data01_24)
     start_24     end_24 min_price max_price med_price count
1  2021-01-25 2021-04-25    820000    820000    820000     1
2  2021-02-25 2021-05-25    820000    820000    820000     1
3  2021-03-25 2021-06-25    780000    995000    820000     3
4  2021-04-25 2021-07-25    780000    995000    887500     2
5  2021-05-25 2021-08-25    780000    995000    887500     2
6  2021-06-25 2021-09-25    905000    905000    905000     1
7  2021-07-25 2021-10-25    905000   1250000   1077500     2
8  2021-08-25 2021-11-25    905000   1250000   1077500     2
9  2021-09-25 2021-12-25    860000   1250000    987500     4
10 2021-10-25 2022-01-25    860000   1200000    985000     7
11 2021-11-25 2022-02-25    860000   1200000   1002500     8
12 2021-12-25 2022-03-25    870800   1504500   1020000     7
13 2022-01-25 2022-04-25    950000   1504500   1017500     4
14 2022-02-25 2022-05-25    950000   1575000   1174000     7
15 2022-03-25 2022-06-25    885000   1575000   1070000     7
16 2022-04-25 2022-07-25    850000   1575000   1122000     8
17 2022-05-25 2022-08-25    850000   1282500    900000     7
18 2022-06-25 2022-09-25    850000   1330000   1200000     8
19 2022-07-25 2022-10-25    889000   1330000   1200000     7
20 2022-08-25 2022-11-25    889000   1330000   1143182     6
21 2022-09-25 2022-12-25    889000   1232000   1080682     4
22 2022-10-25 2023-01-25   1075000   1232000   1086365     3
23 2022-11-25 2023-02-25   1000000   1075000   1037500     2
24 2022-12-25 2023-03-25   1000000   1153000   1076500     2

I am looking for help to iterate the filter and summary function over the two start and end date lists/vectors to get a data frame with min, max, med_dates and count. If the map() function in purrr would also add the start and end dates as well that would be terrific but not absolutely necessary, I know how to add those after the fact.

I have tried unsuccessfully to use slider, that does get close, but the sliding window collapses at every point there is no monthly data, then begins building the three month window anew and effectively ignores the available data in the surrounding months that do have data. Again my skills with slider are limited and although I have read all the instructions I am at a loss to get what I want using slider, or runner for that matter.

I have included the data I am using, the code I have written and an example of what I need. I would respectfully appreciate any help getting from my raw sold_data to the final sum_sold_data01_24 dataframe using purrr and one of the map() functions or in any other way.

Any other solution needs to allow the gap between the start and end dates to be user controlled, accommodate missing data, duplicate data, and allow a custom summary function to be run on each window of data. Many thanks.

Think this should work: ```sum_sold_data <- purrr::map2(start_24, end_24, \(start, end) sold_data %>% filter(close_date > start, close_date <= end) %>% n_month_summary()) %>% dplyr::bind_rows()``` — Ric, Apr 04 '23 at 23:37
Ric, it does work, it is so succinct, its quite elegant. I was hoping a purrr map would do it. I tried the map2 but it didn't work. I understand almost everything about your sweet code chunk except the " \(start, end)" part of the code. Can you please explain that part to me, what it is doing, I'm guessing it is creating another name for the start _24 and end_24 arguments to the two indexes the map is iterating over and passing them to the filter function. Sooo excited, I have been literally working on this for three years trying to solve this problem. — Joseph Harvard, Apr 05 '23 at 04:19
`\(start, end) ...` is shorthand for `function(start, end) { ... }` (braces not needed because the body of the function consist of only one sentence: a chain of pipes). so map2 takes the 2 vectors you created and pass element by element to the `function(start, end)`, then you use these arguments in filter. map2 returns a list of data.frames, finally, you bind all dfs together with `bind_rows` — Ric, Apr 05 '23 at 12:02
Thank you Ric, very clever solution, I would never have figured out the function shorthand piece. Much appreciated thank you so much. — Joseph Harvard, Apr 06 '23 at 17:53

Iterate a filter function over a data frame using start and end date variables with a Purrr map() function

0 Answers0