How to remove values from a vector which are larger than previous values in the vector

Question

I'm working with a large dataset (400M observations) of cars. Each entry has a VIN number, identifying it as a specific car. I group the dataset by VIN, so each group is 1 car (usually with multiple entries). I'm currently using the deployr package (I recently learned R using tidyverse).

I am trying to remove all observations where a vehicle increases in price as the car gets older (appreciates in value over time).

Let's look at 1 car as an example, I've removed all columns besides the relevant data_year (year in which the data was recorded) and veh_price (vehicle price) below:

ex_car <- structure(list(data_year = c(2003, 2008, 2009, 2009, 2010, 2012, 
2013, 2015), veh_price = c(2341, 50, 50, 100, 2600, 450, 3750, 
23000)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

The columns are already sorted by data_year (descending order). As mentioned above, I am attempting to filter out any rows where the vehicle price increases from the previous row.

data_year     veh_price
2003          2341
2008          50            
2009          50            
2009          100           
2010          2600          
2012          450           
2013          3750

the desired output:

data_year     veh_price
2003          2341  
2008          50        
2009          50

I considered using mutate and lead to create a new row of the next value, calculating the difference, and then filtering all negative values (see below)

filtered_sample <- ex_car|> 
  mutate(
    next_price = lead(veh_price),
    diff_p = veh_price - next_price,
  ) |> 
  filter( diff_p >= 0)

However I'm concerned adding 2 columns is too inefficient to run on the large dataset. Is there a better, more efficient way of solving this problem? (I've heard data.table is optimized for larger operations, is there a way to use this package to do achieve the same goal?)

When I was first approaching this issue, I thought of iterating once through the vector, storing a temporary global lowest_price variable (initialized as the first value in the vector). if the next price is lower or equal (<=), set it as the new lowest_price, if it is greater (>), remove the value.

However, I couldn't figure out how to do this well in R.

Thanks for your time, please let me know if I failed to comprehensively describe the problem or if you need any more information.

I was just including the packages I had loaded, I'm still fairly new to R and still blurry on which functions I'm using are just base R vs added packages. This is my first question on stackoverflow so didn't want to get flamed for excluding any potentially important info. — chrishtatu, Aug 17 '23 at 17:33
There is a balance between *"tell us the packages you're using"* (so that we know which functions you are calling) and *"tell us every package that you load, regardless of use here"*. Sometimes extras can be perhaps red herrings or distracting (as here), sometimes they are the cause of problems (e.g., collisions and function masking). Thanks! — r2evans, Aug 17 '23 at 17:48

r2evans · Accepted Answer · 2023-08-17T15:48:09.943

UPDATE: x == cummin(.) might be prone to R FAQ 7.31, a problem with tests of equality on high-precision floating point numbers. While not likely with this data, if the price-like variable has high precision and/or you want to be a bit safer in this regards, use @GregorThomas's suggested alternative of diff(.) <= 0, though this needs to be cumulative somehow:

# dplyr
ex_car %>%
  group_by(VIN) %>%
  filter(cumall(c(TRUE, diff(veh_price) <= 0)))
# data.table
as.data.table(ex_car)[, .SD[c(TRUE, cumsum(diff(veh_price) > 0) == 0),], by = .(VIN)]
# base
ind <- with(ex_car, !!ave(veh_price, VIN, FUN = function(z) c(TRUE, cumsum(diff(z) > 0) == 0)))
ex_car[ind,]

We can use cummin here (relying on non-decreasing data_year). (This is for cumulative min, not a mis-spelling of the spice cumin ;-)

dplyr

library(dplyr)
ex_car %>%
  filter(veh_price == cummin(veh_price))
# # A tibble: 3 × 2
#   data_year veh_price
#       <dbl>     <dbl>
# 1      2003      2341
# 2      2008        50
# 3      2009        50

Grouping by your VIN:

ex_car %>%
  group_by(VIN) %>%
  filter(veh_price == cummin(veh_price))
### or with dplyr_1.1.0 or newer
ex_car %>%
  filter(veh_price == cummin(veh_price), .by = VIN)

data.table

library(data.table)
as.data.table(ex_car)[ veh_price == cummin(veh_price),]
#    data_year veh_price
#        <num>     <num>
# 1:      2003      2341
# 2:      2008        50
# 3:      2009        50

Grouping by your VIN:

as.data.table(ex_car)[ veh_price == cummin(veh_price), by = .(VIN)]

base R

ind <- with(ex_car, veh_price == cummin(veh_price))
ex_car[ind,]
# # A tibble: 3 × 2
#   data_year veh_price
#       <dbl>     <dbl>
# 1      2003      2341
# 2      2008        50
# 3      2009        50

Gropuing by your VIN requires the addition of ave:

ind <- with(ex_car, ave(veh_price, VIN, FUN = function(z) z == cummin(z)))

Interesting approach. I would have used `all(diff(veh_price) <= 0)` — Gregor Thomas, Aug 17 '23 at 15:35
that's a great alternative! perhaps "safer" in the sense of R FAQ 7.31 — r2evans, Aug 17 '23 at 15:37
@chrishtatu see the edit/update please ... likely not a big risk for your data, but noted nonetheless — r2evans, Aug 17 '23 at 15:48
thank you! `cummin` is quite handy in this instance (all integers, so no precision concerns), but I'll probably use the diff solution when I expand into the whole dataset just to be safe! — chrishtatu, Aug 17 '23 at 15:55
Ah, I misunderstood - I thought OP wanted to filter out cars that showed any increases, not just the increase rows. — Gregor Thomas, Aug 17 '23 at 15:59
no worries, it's a good point and I adapted to the OP, thanks! — r2evans, Aug 17 '23 at 16:05

How to remove values from a vector which are larger than previous values in the vector

1 Answers1

dplyr

data.table

base R