2

I'm working with a large dataset (400M observations) of cars. Each entry has a VIN number, identifying it as a specific car. I group the dataset by VIN, so each group is 1 car (usually with multiple entries). I'm currently using the deployr package (I recently learned R using tidyverse).

I am trying to remove all observations where a vehicle increases in price as the car gets older (appreciates in value over time).

Let's look at 1 car as an example, I've removed all columns besides the relevant data_year (year in which the data was recorded) and veh_price (vehicle price) below:

ex_car <- structure(list(data_year = c(2003, 2008, 2009, 2009, 2010, 2012, 
2013, 2015), veh_price = c(2341, 50, 50, 100, 2600, 450, 3750, 
23000)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

The columns are already sorted by data_year (descending order). As mentioned above, I am attempting to filter out any rows where the vehicle price increases from the previous row.

data_year     veh_price
2003          2341
2008          50            
2009          50            
2009          100           
2010          2600          
2012          450           
2013          3750  

the desired output:

data_year     veh_price
2003          2341  
2008          50        
2009          50            

I considered using mutate and lead to create a new row of the next value, calculating the difference, and then filtering all negative values (see below)

filtered_sample <- ex_car|> 
  mutate(
    next_price = lead(veh_price),
    diff_p = veh_price - next_price,
  ) |> 
  filter( diff_p >= 0) 

However I'm concerned adding 2 columns is too inefficient to run on the large dataset. Is there a better, more efficient way of solving this problem? (I've heard data.table is optimized for larger operations, is there a way to use this package to do achieve the same goal?)

When I was first approaching this issue, I thought of iterating once through the vector, storing a temporary global lowest_price variable (initialized as the first value in the vector). if the next price is lower or equal (<=), set it as the new lowest_price, if it is greater (>), remove the value.

However, I couldn't figure out how to do this well in R.

Thanks for your time, please let me know if I failed to comprehensively describe the problem or if you need any more information.

chrishtatu
  • 23
  • 4
  • Side note: how does `deployr` relate to this question? – r2evans Aug 17 '23 at 15:36
  • 1
    I was just including the packages I had loaded, I'm still fairly new to R and still blurry on which functions I'm using are just base R vs added packages. This is my first question on stackoverflow so didn't want to get flamed for excluding any potentially important info. – chrishtatu Aug 17 '23 at 17:33
  • 1
    There is a balance between *"tell us the packages you're using"* (so that we know which functions you are calling) and *"tell us every package that you load, regardless of use here"*. Sometimes extras can be perhaps red herrings or distracting (as here), sometimes they are the cause of problems (e.g., collisions and function masking). Thanks! – r2evans Aug 17 '23 at 17:48

1 Answers1

2

UPDATE: x == cummin(.) might be prone to R FAQ 7.31, a problem with tests of equality on high-precision floating point numbers. While not likely with this data, if the price-like variable has high precision and/or you want to be a bit safer in this regards, use @GregorThomas's suggested alternative of diff(.) <= 0, though this needs to be cumulative somehow:

# dplyr
ex_car %>%
  group_by(VIN) %>%
  filter(cumall(c(TRUE, diff(veh_price) <= 0)))
# data.table
as.data.table(ex_car)[, .SD[c(TRUE, cumsum(diff(veh_price) > 0) == 0),], by = .(VIN)]
# base
ind <- with(ex_car, !!ave(veh_price, VIN, FUN = function(z) c(TRUE, cumsum(diff(z) > 0) == 0)))
ex_car[ind,]

We can use cummin here (relying on non-decreasing data_year). (This is for cumulative min, not a mis-spelling of the spice cumin ;-)

dplyr

library(dplyr)
ex_car %>%
  filter(veh_price == cummin(veh_price))
# # A tibble: 3 × 2
#   data_year veh_price
#       <dbl>     <dbl>
# 1      2003      2341
# 2      2008        50
# 3      2009        50

Grouping by your VIN:

ex_car %>%
  group_by(VIN) %>%
  filter(veh_price == cummin(veh_price))
### or with dplyr_1.1.0 or newer
ex_car %>%
  filter(veh_price == cummin(veh_price), .by = VIN)

data.table

library(data.table)
as.data.table(ex_car)[ veh_price == cummin(veh_price),]
#    data_year veh_price
#        <num>     <num>
# 1:      2003      2341
# 2:      2008        50
# 3:      2009        50

Grouping by your VIN:

as.data.table(ex_car)[ veh_price == cummin(veh_price), by = .(VIN)]

base R

ind <- with(ex_car, veh_price == cummin(veh_price))
ex_car[ind,]
# # A tibble: 3 × 2
#   data_year veh_price
#       <dbl>     <dbl>
# 1      2003      2341
# 2      2008        50
# 3      2009        50

Gropuing by your VIN requires the addition of ave:

ind <- with(ex_car, ave(veh_price, VIN, FUN = function(z) z == cummin(z)))
r2evans
  • 141,215
  • 6
  • 77
  • 149