1

I have a table of house prices and sale dates. I want to calculate the rolling median price over a time window of 365 days using the runner package. I only want one median price per date.

My problem is when I try the below code, I get more than one median price for a date if that date appears more than once. This isn't what I expected to occur. I thought there'd be one result for each day if I used group_by/summarise.

library(runner)
library(tidyverse)
library(lubridate)

startDate = as_date("2018-01-01")
endDate = as_date("2020-01-01")

# Create data
soldData <- tibble(
  price = round(rnorm(100, mean=500000, sd=100000),-3),
  date = sample(seq.Date(startDate,endDate,by="days"),100,replace=T))

# Fill in the missing dates between startDate and endDate
soldData <- bind_rows(soldData,anti_join(tibble(date=seq.Date(startDate,endDate,by="day")),soldData)) %>%
  arrange(date)

# Find the duplicated dates
duplicatedDates <- soldData[duplicated(soldData$date),]$date

# I thought using group_by/summarise would return one medianPrice per date
results <- soldData %>%
  group_by(date) %>%
  summarise(medianPrice = runner(
    price,
    k = "365 days",
    idx = date,
    f = function(x) {median(x,na.rm=T)})) 

# These are the problem rows.
duplicatedResults <- results %>%
  filter(date %in% duplicatedDates)

Any idea where I'm going wrong?

Wilze
  • 53
  • 5
  • 1
    When you `group_by` date there are no observation left to roll on. Try to group by something else or don't group as @Ronak Shah proposed. – GoGonzo Aug 19 '20 at 14:10

1 Answers1

2

From dplyr 1.0.0, you can have output that returns multiple rows from summarise.

First you need to deal with duplicate data which you already have in your data. What do you want to do of dates that have multiple occurrence? One way would be to take median/mean of them.

library(dplyr)
library(runner)

soldData %>%
  group_by(date) %>%
  summarise(price = median(price, na.rm = TRUE)) -> df

So now in df we only have one value for each date. You can now apply the runner function.

df %>%
    mutate(medianPrice = runner(price,
                                k = "365 days",
                                idx = date,
                                f = function(x) {median(x,na.rm=T)})) 

There is also zoo:rollmedianr which helps in calculating rolling median.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213