binning, grouping data in r using values withing a specific range to determine an event or change for survival analysis

Question

I have a dataframe as follows:

df <- data.frame(as.date=c("14/06/2016","15/06/2016","16/06/2016","17/06/2016","18/06/2016","19/06/2016","20/06/2016","21/06/2016","22/06/2016","23/06/2016",
                    "24/06/2016","04/07/2016","05/07/2016","06/07/2016","07/07/2016","08/07/2016","09/07/2016","10/07/2016","11/07/2016","12/07/2016",
                    "13/07/2016","14/07/2016","15/07/2016","17/07/2016","18/07/2016","19/07/2016","20/07/2016","21/07/2016","22/07/2016","01/08/2016",
                    "02/08/2016","03/08/2016","04/08/2016","05/08/2016","06/08/2016","07/08/2016","08/08/2016","09/08/2016","10/08/2016","11/08/2016",
                    "12/08/2016","13/08/2016","14/08/2016","15/08/2016","16/08/2016","17/08/2016","18/08/2016","19/08/2016","20/08/2016","21/08/2016",
                    "22/08/2016","23/08/2016","24/08/2016","25/08/2016","26/08/2016","27/08/2016","28/08/2016","29/08/2016","30/08/2016","31/08/2016",
                    "01/09/2016","02/09/2016","03/09/2016","04/09/2016","05/09/2016","06/09/2016","07/09/2016","08/09/2016","09/09/2016","10/09/2016",
                    "11/09/2016","12/09/2016","13/09/2016","14/09/2016","15/09/2016","16/09/2016","17/09/2016","18/09/2016","19/09/2016","20/09/2016"),
             wear=c("0","55","0","0","0","0","8","8","15","25","30","37","43","49","52","52","55","57","57","61","67","69","2","2","7",
                    "10","13","14","16","16","19","22","22","24","25","26","29","29","33","34","34","36","38","44","45","48","50","55",
                    "56","58","0","4","0","4","4","6","9","9","12","14","16","17","25","25","33","36","44","46","48","52","55","59",
                    "8","9","9","12","24","33","36","44"))

the data is an example of wear rate on a type of metal on a machine, it increases over time them drops to 0, indicating an event or a change,

but the problem that I have is that the wear value doesn't drop off to 0, as you can see from the data, there are 2 variables

as.date = date over time, wear = wear of metal on a part over time

RANGE in between changes are: 55-0, 60-2, 58-0, 59-8

when it drops from a large number to 0 it is easy to code,I use the following code to change,and add new variables called Status & id

{Creates 2 new columns status & id
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$Status <- 0# fills in column status with all zeros
df$Status[wear > -10 & wear == 0] <- 1 # fill in 1s when wear = 0
prop.table(table(df$Status))
prop.table(table(df$Status),1) # creates new coulmn called status
df$id <-1# fills in column status with '1's

for(i in 2:nrow(df)){
  if(df$Status[i-1]==0){
    df$id[i]=df$id[i-1]
  }
  else {
    df$id[i]=df$id[i-1]+1
  }
}
}

it will work OK to catch a drop in wear values to 0 but when there isn't, as in the data examples, the wear drops take place from 55-0, 69-2, 58-0, 59-8, within the real data set sometimes there are occasions when the drop in wear values will be negative, not sure on correct way to achieve this, I tried messing around with binning and grouping the data but was unsuccessful.

this is a sample of the data, in the real data set there are 100+ events, mostly a wear value drop to 0 but between 10-20 occasions either dropping to negative values or a values < 10.

score 0 · Answer 1 · answered Dec 05 '17 at 13:50

I think for-loop is inefficient. We can do something like this using the dplyr and lubridate package.

library(dplyr)
library(lubridate)

df2 <- df %>%
  # Convert the as.date column to date class
  # Convert the wear column to numeric 
  mutate(as.date = dmy(as.date), 
         wear = as.numeric(as.character(wear))) %>%
  # Create column show the wear of previous record
  mutate(wear2 = lag(wear)) %>%
  mutate(Diff = wear - wear2)

The idea is to shift the wear column by 1, and then calculate the difference between the wear of the date and the previous date. the results are saved in the new column as Diff. Here is what the new data frame looks like.

head(df2)
#      as.date wear wear2 Diff
# 1 2016-06-14    0    NA   NA
# 2 2016-06-15   55     0   55
# 3 2016-06-16    0    55  -55
# 4 2016-06-17    0     0    0
# 5 2016-06-18    0     0    0
# 6 2016-06-19    0     0    0

After this, you can define a threshold in Diff to filter out an end of a period. For example, here I defined the threshold to be -50. You can see that the filter function successfully identify four periods.

# Filter Diff <= -50
df2 %>% filter(Diff <= -50)
#      as.date wear wear2 Diff
# 1 2016-06-16    0    55  -55
# 2 2016-07-15    2    69  -67
# 3 2016-08-22    0    58  -58
# 4 2016-09-13    8    59  -51

One final note, in your original data frame, the wear column is in factor, but you calculate it as numeric. This is dangerous. I used wear = as.numeric(as.character(wear)) to convert the column to numeric, but it would be great if you can create the numeric column in the first place.

OK, get what you are doing but on occasions the wear value can be as low as 45 and as high as 78 doesn't always be around 55, would it be better to do maybe 2 or 3 filters ? just thinking how this could change the status to change — nigel griffin, Dec 05 '17 at 14:01
Of course you can decide what kinds of filtering strategy can best suit your needs. — www, Dec 05 '17 at 14:04

binning, grouping data in r using values withing a specific range to determine an event or change for survival analysis

1 Answers1