2

I have a grouped dataframe;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,30000)
DF <- data.frame(Truck, OilChanged, Odometer)

# Truck OilChanged Odometer
# 1      A       True     1000
# 2      A     NewOil     1000
# 3      A      False     2000
# 4      A      False     3000
# 5      B      False      700
# 6      B      False      800
# 7      B      False      900
# 8      B      False     1000
# 9      C       True    20000
# 10     C     NewOil    20000
# 11     C       True    30000
# 12     C     NewOil    30000

I'm trying to infer the age of the oil (in kilometers) when possible. Inference is only possible once an oil change has occurred. If no oil change occurs the age of the oil will remain a mystery (example: Truck B).

Below is the desired result;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000, 3000,700,800,900,1000,20000,20000,30000,30000)
OilAge <- c(NA,0,1000,2000,NA,NA,NA,NA,NA,0,10000,0)
Result <- data.frame(Truck, OilChanged, Odometer, OilAge)


# Truck OilChanged Odometer OilAge
# 1      A       True     1000     NA
# 2      A     NewOil     1000      0
# 3      A      False     2000   1000
# 4      A      False     3000   2000
# 5      B      False      700     NA
# 6      B      False      800     NA
# 7      B      False      900     NA
# 8      B      False     1000     NA
# 9      C       True    20000     NA
# 10     C     NewOil    20000      0
# 11     C       True    30000  10000
# 12     C     NewOil    30000      0

Note: The odometer reading between a True oilchanged row and it's following NewOil row will always be the same. Because an oil sample is taken directly before the oil is changed. But both rows must be maintained for downstream calculations to function properly, such as rate-of-change formulas.

NA in the OilAge column means the age is a mystery.

Brad
  • 580
  • 4
  • 19

2 Answers2

1

Please let me know if this solution works for you.

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,30000)
DF <- data.frame(Truck, OilChanged, Odometer)

DF %>%
  group_by(Truck) %>%
  mutate(status = length(unique(OilChanged)),
         OilAge = ifelse(OilChanged == "NewOil", 0,
                         ifelse(OilChanged == "False", Odometer - (Odometer - lag(Odometer)),
                                ifelse(OilChanged == "True", Odometer - lag(Odometer), NA)))) %>%
  mutate(OilAge = ifelse(status !=1, OilAge, NA)) %>%
  subset(select = c(Truck, OilChanged, Odometer, OilAge))
Brad
  • 580
  • 4
  • 19
DJS
  • 73
  • 6
1

An alternative approach

DF %>% group_by(Truck)  %>%
  mutate(d = cumsum(OilChanged == 'NewOil')) %>%
  group_by(Truck, d) %>%
  mutate(OilAge = cumsum(c(0*NA^(as.logical(!(first(d)))), diff(NA^(as.logical(!d))*Odometer))))

# A tibble: 12 x 5
# Groups:   Truck, d [6]
   Truck OilChanged Odometer     d OilAge
   <chr> <chr>         <dbl> <int>  <dbl>
 1 A     True           1000     0     NA
 2 A     NewOil         1000     1      0
 3 A     False          2000     1   1000
 4 A     False          3000     1   2000
 5 B     False           700     0     NA
 6 B     False           800     0     NA
 7 B     False           900     0     NA
 8 B     False          1000     0     NA
 9 C     True          20000     0     NA
10 C     NewOil        20000     1      0
11 C     True          30000     1  10000
12 C     NewOil        30000     2      0

d is a dummy variable which you may `unselect after understanding what's has been done

Brad
  • 580
  • 4
  • 19
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45