-1

I am new to R and I have a problem at hand. I basically want to create a dataframe that contains a dummy variable for every year somebody has had a house. 0 for every year he did not sell, and 1 for the year he did sell. Next to this I need the year iteslf, and the year the house was bought in, because I have a seperate dataset with the housing price index for every year. I guess this would be done best with some sort of loop, since I have 40.000 transactions

Example data I have now

Buy Sold
1620 1624
1622 1628

What I then need it to become

dummy year bought current year
0 1620 1621
0 1620 1622
0 1620 1623
1 1620 1624
0 1622 1623
0 1622 1624
0 1622 1625
0 1622 1626
0 1622 1627
1 1622 1628

Then ultimately I also need another column with the price difference between the housing price index in current year - the price index of the year the house was bought. I do have a seperate dataset with the price index for each year. I don't know how to do that but I guess that would be relatively easy once I have this data figured out. Thanks in advance!

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45

3 Answers3

0

A tidyverse option -

library(tidyverse)

df %>%
  mutate(current_year = map2(Buy + 1, Sold, seq)) %>%
  unnest(current_year) %>%
  mutate(dummy = as.integer(Sold == current_year), .before = 1) %>%
  select(-Sold, Year_bought = Buy)

#   dummy Year_bought current_year
#   <int>       <int>        <int>
# 1     0        1620         1621
# 2     0        1620         1622
# 3     0        1620         1623
# 4     1        1620         1624
# 5     0        1622         1623
# 6     0        1622         1624
# 7     0        1622         1625
# 8     0        1622         1626
# 9     0        1622         1627
#10     1        1622         1628

map2 creates a sequence between Buy and Sold, create a dummy column which is 1 when current_year is same as year sold.

data

df <- structure(list(Buy = c(1620L, 1622L), Sold = c(1624L, 1628L)),
      class = "data.frame", row.names = c(NA, -2L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

A data.table solution:

library(data.table)

data <- data.table(Buy = c(1620, 1622), Sell = c(1624, 1628))
data <- data[, .(`year bought` = Buy, `current year` = seq(Buy, Sell, by = 1)), , 
             .(grp = 1:nrow(data))][, grp := NULL]
data[, dummy := ifelse(`current year` == max(`current year`), 1, 0), by=.(`year bought`)]

I do want to add that it will become problematic if there is more than 1 house with the same buy year or if there are houses that are not sold yet. If either of these exist in your data I can modify the code somewhat.

koolmees
  • 2,725
  • 9
  • 23
0

One more approach

df <- structure(list(Buy = c(1620L, 1622L), Sold = c(1624L, 1628L)),
                class = "data.frame", row.names = c(NA, -2L))

library(tidyverse)
df %>% group_by(grp = data.table::rleid(Buy)) %>%
  uncount(Sold - Buy) %>%
  mutate(current_year = first(Buy) + row_number()) %>%
  ungroup() %>%
  mutate(Sold = +(Sold == current_year)) %>%
  select(dummy = Sold, year_bought = Buy, current_year)
#> # A tibble: 10 x 3
#>    dummy year_bought current_year
#>    <int>       <int>        <int>
#>  1     0        1620         1621
#>  2     0        1620         1622
#>  3     0        1620         1623
#>  4     1        1620         1624
#>  5     0        1622         1623
#>  6     0        1622         1624
#>  7     0        1622         1625
#>  8     0        1622         1626
#>  9     0        1622         1627
#> 10     1        1622         1628

Created on 2021-06-11 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45