2

I have a longitudinal dataset with ID, Wave (Wave1-4), and Score. Here's sample data with the same structure. The length of the original data is around 2000, with 500 participants total, put in long form.

   ID   Wave Score
 1 1001 1    28
 2 1001 2    27 
 3 1001 3    28
 4 1001 4    26
 5 1002 1    30
 6 1002 3    30
 7 1003 1    30
 8 1003 2    30
 9 1003 3    29
 10 1003 4   28
 11 1004 1   22
 12 1005 1   20
 13 1005 2   18
 14 1006 1   22
 15 1006 2   23
 16 1006 3   25
 17 1006 4   19

I would like to select the 'ID's with all four measurements of 'Score' available. In other words, I want to select rows of the participants with 'Score' available for all 4 waves. I've been trying to select rows with 'ID's that have data in all 'Wave's. My tryout so far has been based on this idea: if a participant has all four measurements, the ID will appear in the data four times. That's why I tried to count the number of IDs,

table(data$id) == 4

and although it showed me the number of each ID appearing in the data, I cannot select the corresponding rows.

all.data <- subset(data, subset=table(data$id) == 4)

Because the length of the original data is different, being in long form. "Length of logical index must be 1 or 2637, not 828" I would need a long-form data for further analysis, so I wish not to change it.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Eslie
  • 23
  • 3

5 Answers5

1

You can try:

df[as.logical(with(df, ave(Wave, ID, FUN = function(x) length(x) == 4))), ]

     ID Wave Score
1  1001    1    28
2  1001    2    27
3  1001    3    28
4  1001    4    26
7  1003    1    30
8  1003    2    30
9  1003    3    29
10 1003    4    28
14 1006    1    22
15 1006    2    23
16 1006    3    25
17 1006    4    19

Or if you want to keep your basic idea, a slight modification of @jay.sf code:

df[df$ID %in% names(which(table(df$ID) == 4)), ]
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
0

instead of feeding table(data$ID), try with

ID %in% names(table(data$ID)[table(data$ID)==4])

As the table gives you the number of occurrences for each ID (named vector)

boski
  • 2,437
  • 1
  • 14
  • 30
0

I like your table() approach.

> table(d$ID) == 4

 1001  1002  1003  1004  1005  1006 
 TRUE FALSE  TRUE FALSE FALSE  TRUE 

The interesting IDs are in the names() though. So to get your code to work you could extract the IDs like so

subs <- names(which(table(d$ID) == 4))

and get your desired subset using %in%.

all.data <- subset(d, subset=d$ID %in% subs)

Result

> all.data
     ID Wave Score
1  1001    1    28
2  1001    2    27
3  1001    3    28
4  1001    4    26
7  1003    1    30
8  1003    2    30
9  1003    3    29
10 1003    4    28
14 1006    1    22
15 1006    2    23
16 1006    3    25
17 1006    4    19

(BTW: Always make sure with ?<name> that you do not define any existing function names as object names, this will save you a lot of trouble. In your case type ?data in a fresh session before loading the object.)

Data

> dput(d)
structure(list(ID = c(1001L, 1001L, 1001L, 1001L, 1002L, 1002L, 
1003L, 1003L, 1003L, 1003L, 1004L, 1005L, 1005L, 1006L, 1006L, 
1006L, 1006L), Wave = c(1L, 2L, 3L, 4L, 1L, 3L, 1L, 2L, 3L, 4L, 
1L, 1L, 2L, 1L, 2L, 3L, 4L), Score = c(28L, 27L, 28L, 26L, 30L, 
30L, 30L, 30L, 29L, 28L, 22L, 20L, 18L, 22L, 23L, 25L, 19L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17"))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

This is a quick data.table answer.

  library(data.table)
  dt <- structure(list(ID = c(1001, 1001, 1001, 1001, 1002, 1002, 1003, 
  1003, 1003, 1003, 1004, 1005, 1005, 1006, 1006, 1006, 1006), 
Wave = c(1, 2, 3, 4, 1, 3, 1, 2, 3, 4, 1, 1, 2, 1, 2, 3, 
4), Score = c(28, 27, 28, 26, 30, 30, 30, 30, 29, 28, 22, 
20, 18, 22, 23, 25, 19)), row.names = c(NA, -17L), class = c("data.table", 
"data.frame"))

dt[ , .(Score, N = uniqueN(.SD)) , by = list(ID), .SDcols = c("Wave")][N == 4,]

 >   ID Score N
 1: 1001    28 4
 2: 1001    27 4
 3: 1001    28 4
 4: 1001    26 4
 5: 1003    30 4
 6: 1003    30 4
 7: 1003    29 4
 8: 1003    28 4
 9: 1006    22 4
10: 1006    23 4
11: 1006    25 4
12: 1006    19 4
hannes101
  • 2,410
  • 1
  • 17
  • 40
0

For the sake of completeness, here are two data.table solutions. Both identify those IDs for which Wave has values 1 to 4. One approach uses subsetting, the other one is joining.

Subsetting

library(data.table)
setDT(df)[ID %in% dt[ , which(uniqueN(Wave) == 4L), by = ID]$ID]
      ID Wave Score
 1: 1001    1    28
 2: 1001    2    27
 3: 1001    3    28
 4: 1001    4    26
 5: 1003    1    30
 6: 1003    2    30
 7: 1003    3    29
 8: 1003    4    28
 9: 1006    1    22
10: 1006    2    23
11: 1006    3    25
12: 1006    4    19

Joining

library(data.table)
setDT(df)[df[, .N, .(ID, Wave)][, .N, ID][N == 4L, .(ID)], on = "ID"]

which returns the same result.

Data

library(data.table)
fread("
rn ID   Wave Score
 1 1001 1    28
 2 1001 2    27 
 3 1001 3    28
 4 1001 4    26
 5 1002 1    30
 6 1002 3    30
 7 1003 1    30
 8 1003 2    30
 9 1003 3    29
 10 1003 4   28
 11 1004 1   22
 12 1005 1   20
 13 1005 2   18
 14 1006 1   22
 15 1006 2   23
 16 1006 3   25
 17 1006 4   19", drop = 1L)
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134