I'm working with a large dataset (400M observations) of cars. Each entry has a VIN number, identifying it as a specific car. I group the dataset by VIN, so each group is 1 car (usually with multiple entries). I'm currently using the deployr package (I recently learned R using tidyverse).
I am trying to remove all observations where a vehicle increases in price as the car gets older (appreciates in value over time).
Let's look at 1 car as an example, I've removed all columns besides the relevant data_year (year in which the data was recorded) and veh_price (vehicle price) below:
ex_car <- structure(list(data_year = c(2003, 2008, 2009, 2009, 2010, 2012,
2013, 2015), veh_price = c(2341, 50, 50, 100, 2600, 450, 3750,
23000)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
The columns are already sorted by data_year (descending order). As mentioned above, I am attempting to filter out any rows where the vehicle price increases from the previous row.
data_year veh_price
2003 2341
2008 50
2009 50
2009 100
2010 2600
2012 450
2013 3750
the desired output:
data_year veh_price
2003 2341
2008 50
2009 50
I considered using mutate
and lead
to create a new row of the next value, calculating the difference, and then filtering all negative values (see below)
filtered_sample <- ex_car|>
mutate(
next_price = lead(veh_price),
diff_p = veh_price - next_price,
) |>
filter( diff_p >= 0)
However I'm concerned adding 2 columns is too inefficient to run on the large dataset. Is there a better, more efficient way of solving this problem? (I've heard data.table is optimized for larger operations, is there a way to use this package to do achieve the same goal?)
When I was first approaching this issue, I thought of iterating once through the vector, storing a temporary global lowest_price
variable (initialized as the first value in the vector). if the next price is lower or equal (<=), set it as the new lowest_price
, if it is greater (>), remove the value.
However, I couldn't figure out how to do this well in R.
Thanks for your time, please let me know if I failed to comprehensively describe the problem or if you need any more information.