2

I have a panel dataset:

data <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
                   year = c(1,2,3,4,1,2,1,2,3),
                   score1 = c(90,78,92,69,86,73,82,85,91))

> data
   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  2    1     86
6:  2    2     73
7:  3    1     82
8:  3    2     85
9:  3    3     91

I want to place a constraint such that every ID should have at least 3 years of observations. I tried using the following using the data.table package:

data[data$year >= 3, ]

However this ignores the observations in years 1 & 2 for ID's which have at least 3 years of observations. In other words, I want to only look at ID's which have at least 3 years of observations and also include the year 1 & 2 observations.

The expected output is thus:

   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  3    1     82
6:  3    2     85
7:  3    3     91
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
codemachino
  • 103
  • 9

2 Answers2

3

We may use a group by approach

library(dplyr)
data %>%
    group_by(ID) %>% 
    filter(n_distinct(year) >= 3) %>%
    ungroup

-output

# A tibble: 7 x 3
     ID  year score1
  <dbl> <dbl>  <dbl>
1     1     1     90
2     1     2     78
3     1     3     92
4     1     4     69
5     3     1     82
6     3     2     85
7     3     3     91

Or using data.table

library(data.table)
data <- setDT(data)[data[, .I[uniqueN(year) >=3], by = ID]$V1]

-output

data
    ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  3    1     82
6:  3    2     85
7:  3    3     91
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Is it possible to do this using data.table in R? – codemachino Aug 25 '21 at 19:06
  • Thanks! What does the $V1 do here? – codemachino Aug 25 '21 at 19:11
  • I have edited my question so that I'm working with a data.table from the start. In this case, I have tried using your example but it isn't working for me: `data[, .I[uniqueN(year) >=3], ID]$V1` – codemachino Aug 25 '21 at 19:18
  • 1
    @codemachino the `setDT` converts to data.table. In your code, it is still a data.frame, whereas I added the `setDT` at the outer one. If you change your code to `setDT(data)`, it should work – akrun Aug 25 '21 at 19:19
  • I have edited the question to account for this. When I run the code: `data[, .I[uniqueN(year) >=3], ID]$V1` this creates a list rather than a data table – codemachino Aug 25 '21 at 19:20
  • 1
    @codemachino The `$V1` is the column of indexes from `.I` which is used in subsetting `data[...]` – akrun Aug 25 '21 at 19:21
  • I will keep searching for a method that works for me. Do you know of any other approaches to this problem? – codemachino Aug 25 '21 at 19:22
  • 1
    @codemachino what is your expected output as I don't see in your post – akrun Aug 25 '21 at 19:23
  • The data table in the post is only a reproducible example of my much larger dataset. It is essentially an unbalanced panel dataset and I want there to be **at least** 3 observations per subject – codemachino Aug 25 '21 at 19:24
  • 1
    @codemachino sorry, I don't understand what your expected output is. What is the issue by copying the same code in my post. – akrun Aug 25 '21 at 19:25
  • I get a rather large error message when I copy your data.table code directly. I attempted to use the code without the `setDT()` and this didn't work for me either: `data[, .I[uniqueN(year) >=3], ID]$V1` – codemachino Aug 25 '21 at 19:27
  • 1
    @codemachino that wouldn't work if your data is not `data.table` as these methods are data.table methods and it does work only with `dat.atable` i..e the code you posted will work if you have already converted to data.table i.e. `setDT(data)` first – akrun Aug 25 '21 at 19:28
  • The table is already converted to `data.table`. When I use the code: `data[, .I[uniqueN(year) >=3], ID]$V1` this creates a list rather than a data table – codemachino Aug 25 '21 at 19:29
  • 1
    @codemachino IT is a vector of index you need `data[data[, .I[uniqueN(year) >=3], ID]$V1]` – akrun Aug 25 '21 at 19:30
  • It seems this creates an error message: When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM. – codemachino Aug 25 '21 at 19:32
  • 1
    @codemachino can you show the `packageVersion('data.table')` – akrun Aug 25 '21 at 19:37
  • It is an error message. I changed the code to: `data[data[, .I[uniqueN(year) >=3], on = .(ID)]$V1]` which then creates a warning message: When on= is provided but not i=, on= must be a named list or data.table|frame, and a natural join (i.e. join on common names) is invoked. Ignoring on= which is 'list'. – codemachino Aug 25 '21 at 19:37
  • packageVersion('data.table') is '1.14.0’ – codemachino Aug 25 '21 at 19:38
  • 1
    I have the same version – akrun Aug 25 '21 at 19:39
  • 1
    @codemachino it is not an `on` join. It is group by operation on ID – akrun Aug 25 '21 at 19:39
  • 1
    @codemachino Can you change it to specifically with `by` i.e. `setDT(data)[data[, .I[uniqueN(year) >=3], by = ID]$V1]` – akrun Aug 25 '21 at 19:40
  • The code runs okay when I use this, however the data is unchange (i.e. the constraint has not been applied) – codemachino Aug 25 '21 at 19:42
  • 1
    @codemachino it wouldn't update the original object. You may have to assign i.e. `data <- setDT(data)[data[, .I[uniqueN(year) >=3], by = ID]$V1]` or create a new object `data1 <- setDT(data)[data[, .I[uniqueN(year) >=3], by = ID]$V1]` – akrun Aug 25 '21 at 19:43
  • 1
    That is great, thank you so much for your help today! – codemachino Aug 25 '21 at 19:44
2

Another data.table option

> data[, .SD[uniqueN(year) >= 3], ID]
   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  3    1     82
6:  3    2     85
7:  3    3     91
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81