Filtering (temporally similar) duplicates while keeping max value based on another column in R

Question

I'm hoping to clean out a time series dataset so that only the maximum value of each event is retained. To start, I filtered the data so that only values above a certain threshold are maintained but there are still values that, while separated by a millisecond or two, act as duplicate values but will throw off later analysis.

My initial dataset has >100,000 rows and a few more columns but here is the top of a smaller version.

head(shortfilter)
  Time (Sec) ECG (Channel 6)
1   5534.023        1.371761
2   5534.024        1.232424
3   5534.152        1.414432
4   5534.153        1.359914
5   5534.272        1.639033
6   5534.396        1.476161

Explained: I don't have a concrete time value that they need to be within for it to be considered a duplicate, but the rest of the data is similar to this in that they are generally within .003 s.

 Time (Sec) ECG (Channel 6)
1   5534.023        1.371761 #<-- Higher value (keep)
2   5534.024        1.232424
3   5534.152        1.414432 #<-- Higher value (keep)
4   5534.153        1.359914
5   5534.272        1.639033 #<-- Only value (keep)
6   5534.396        1.476161 #<-- Only value (keep)

Ideal:

 Time (Sec) ECG (Channel 6)
1   5534.023        1.371761
2   5534.152        1.414432
3   5534.272        1.639033
4   5534.396        1.476161
5   ____.___        _.______
6   ____.___        _.______

I'll add my initial attempt at some conditionals to do what I was hoping, but keep in mind I'm new to coding in general and so I know it isn't remotely correct, just wanted to get some ideas out there. Hope it can give some additional info on what I hope to do though. I'm positive the formatting & syntax are complete gibberish but I'm sure many of you will understand what I was going for lol...

for (i in shortfilter$`Time (Sec)`){
  for (j in shortfilter$`ECG (Channel 6)`){
    if ((i+1)-i > 0.01 && j > j+1){
      remove(j+1)
    } else if ((i+1)-i > 0.01 && j < j+1){
      remove(j)
    }
  }
}

dcsuka · Accepted Answer · 2022-07-19T18:12:22.627

0

Welcome to StackOverflow! My solution compares each value to the next value and finds the difference, then adjusts the predicted grouping number based on those values. Currently it can handle up to five consecutive duplicated numbers, but you can easily add more if you would like.

library(tidyverse)

tibble::tribble(
   ~`Time`, ~`ECG`,
  5534.023, 1.371761,
  5534.024, 1.232424,
  5534.025, 1.27,
  5534.026, 1.28,
  5534.152, 1.414432,
  5534.153, 1.359914,
  5534.272, 1.639033,
  5534.396, 1.476161
  ) %>%
  arrange(Time) %>%
  mutate(sim_val = if_else(!is.na(lead(Time)), lead(Time) - Time, 5),
         Num = if_else(sim_val <= 0.03, row_number() + 1, as.numeric(row_number())),
         Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num),
         Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num),
         Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num)) %>%
  arrange(Num, desc(ECG)) %>%
  group_by(Num) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  select(Time, ECG)

Also, feel free to fine-tune the threshold of 0.03 to your data. Let me know if this works!

edited Jul 19 '22 at 18:12

answered Jul 19 '22 at 17:38

dcsuka

2,922
3
6
27

Great, thanks so much! I haven't yet searched the dataset to see what the max number of consecutive duplicates is, but I'm quite sure it won't be more than 3-4. Much appreciate your help though, I'll let you know how it works a little later! – Tyler for R Jul 19 '22 at 18:34
Sure! I am happy to help. – dcsuka Jul 19 '22 at 18:35
Just tried it and I'm able to run your exact code of course, but when I try to use your solution on the data as I have it, I'm getting this error: Error in `mutate()`: ! Problem while computing `sim_val = if_else(!is.na(lead(Time)), lead(Time) - Time, 5)`. Caused by error in `lead(Time) - Time`: ! non-numeric argument to binary operator. Any idea how I may go about fixing this? Seems similar to some errors I was getting earlier as well. – Tyler for R Jul 19 '22 at 19:03
Can you send me the dput(head(data))? Make sure that all of the values are numeric with `%>% mutate(across(everything(), as.numeric)) %>%`. I get that same error if I convert a column to character beforehand. – dcsuka Jul 19 '22 at 22:46
Yes, that was the issue, once I converted the values to numerics it all worked. Thanks again! – Tyler for R Jul 20 '22 at 14:14

Filtering (temporally similar) duplicates while keeping max value based on another column in R

1 Answers1