data.table: Find repeating data for each group and column name. Output group and name of column that repeats

Question

I'm looking to detect "frozen" sensor data using data.table. To do this, I can check the last 3 days of data and see if, for a given column, that data is identical. I'd like the output to be the UID of the sensor that is frozen and then the name of the frozen sensor stream (i.e. "temperature", "pressure", etc.).

Example: I have the following data.table:

library(data.table)
dt <- data.table(UID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
                 report_date = rep(seq.Date(as.Date("2019/1/1"), as.Date("2019/1/6"), 'days'), 3), 
                 temperature = c(101,103,105,101,105,104,105,103,101,102,102, 102,103,101,102,105,106,111), 
                 pressure = c(10.2,10.4,10.7,10.5,10.1,10.2,10.1,10.5,10.11,10.22,10.11,10.11,10.1,10.1,10.1,10.3,10.3,10.3), 
                 voltage = c(0.023,0.025,0.011,0.013,0.0254,0.028,0.028,0.077,0.026,0.0236,0.0258,0.0214,0.089,0.034,0.0387,0.0654,0.0246,0.02133))

dt
     UID report_date temperature pressure voltage
 1:   1  2019-01-01         101    10.20 0.02300
 2:   1  2019-01-02         103    10.40 0.02500
 3:   1  2019-01-03         105    10.70 0.01100
 4:   1  2019-01-04         101    10.50 0.01300
 5:   1  2019-01-05         105    10.10 0.02540
 6:   1  2019-01-06         104    10.20 0.02800
 7:   2  2019-01-01         105    10.10 0.02800
 8:   2  2019-01-02         103    10.50 0.07700
 9:   2  2019-01-03         101    10.11 0.02600
10:   2  2019-01-04         102    10.22 0.02360
11:   2  2019-01-05         102    10.11 0.02580
12:   2  2019-01-06         102    10.11 0.02140
13:   3  2019-01-01         103    10.10 0.08900
14:   3  2019-01-02         101    10.10 0.03400
15:   3  2019-01-03         102    10.10 0.03870
16:   3  2019-01-04         105    10.30 0.06540
17:   3  2019-01-05         106    10.30 0.02460
18:   3  2019-01-06         111    10.30 0.02133

The output should yield 2 rows, 1. with ID = 2 and colID = "temperature" and 2. with ID = 3 and colID = "pressure" because temperature repeated itself 3 times in the last 3 days for ID = 2 and pressure for ID = 3.

    UID  colID
1.    2  temperature
2.    3  pressure

I can think of ways to do this with base R and looping but am at a loss for the proper data.table way.

why do you want to use only the data.table package? – Cris Jan 10 '19 at 20:21 — Cris, Jan 10 '19 at 20:21

dww · Accepted Answer · 2019-01-11T23:22:18.263

You can use

frozen = function(x) uniqueN(tail(x, 3)) == 1
dt[, .(colID = c(
  'temperature'[frozen(temperature)],
  'pressure'[frozen(pressure)])), 
  by=UID]
#    UID       colID
# 1:   2 temperature
# 2:   3    pressure

An alternative version is to melt the data before subsetting on the variables that are frozen. This could be a bit more succinct if there are many measurement variables, not just temperature and pressure as in your example.

cols = c('temperature', 'pressure')
dtm = melt(dt, id = 'UID', measure = cols)
dtm[, frozen(value), by = .(UID, variable)][(V1)]

A note on the correct test for whether the values are "frozen"

There are various ways to test for equality among elements of the same vector. Have a look at the answers here to see some options. The best approach depends on exactly what you want. Using UniqueN as I did above will check whether the values are truly frozen in the sense that the floating point numbers are identical. Depending on your specific use case and how the instrument works, You may prefer to check whether they are the "same" according to a different criterion, such as nearly the same within an absolute or a relative machine tolerance. Be sure to read the comments on this answer for some of the nuances.

If you decide to go with a different test, just substitute it where I use UniqueN above.

I appreciate the response. I was working on a solution using SDcols and lapply and having a lot of trouble. Any ideas on how one might use SDcols here so the column variables can be defined outside of the data.table function (i.e. cols = c('temperature', 'pressure', 'etc.') — coomie, Jan 11 '19 at 14:26
Sure, just use the 2nd version with `measure.vars = cols`. I edited the answer to show this. No need for any loops (such as `lapply`). I expect the melt version will be both more efficient and neater. — dww, Jan 11 '19 at 16:17

Clayton Stanley · Answer 2 · 2019-01-11T21:51:34.723

2

Should be speedy for larger datasets and UIDs:

> dt[, .SD
     ][, melt(.SD, id.vars=c('UID', 'report_date'))
     ][order(UID, variable, -report_date)
     ][, rowid := rowid(UID, variable)
     ][rowid <= 3
     ][, .N, .(UID, variable, value)
     ][N == 3
     ][, setnames(.SD, 'variable', 'colID')
     ][, N := NULL
     ][, value := NULL
     ][]

   UID       colID
1:   2 temperature
2:   3    pressure
>

edited Jan 11 '19 at 21:51

answered Jan 11 '19 at 20:48

Clayton Stanley

7,513
9
32
46

data.table: Find repeating data for each group and column name. Output group and name of column that repeats

2 Answers2

A note on the correct test for whether the values are "frozen"