Identify group with two variables

Question

Suppose I have the following data in Stata:

clear 
input id tna ret str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
end

I want to make an ID for new groups. These new groups should incorporate the observations with the same name (for example X), but should also incorporate all the observations of the same ID if the name started in that ID. For example:

X is in the data set under two IDs: 1 and 2. The group of X should incorporate all the observations with the name X, but also the two observations of the name P (since X started in ID 2 and the two observations with value P belong to group X)
Y started in ID 3, so the group should incorporate every observation with ID 3.

Robert Picard · Accepted Answer · 2016-05-01T15:20:00.643

This is a tricky problem to solve because it may take several pass to completely stabilize identifiers. Fortunately, you can use group_id (from SSC) to solve this. To install group_id, type in Stata's Command window:

ssc install group_id

Here's a more complicated data example where "P" also appears in ID == 4 and that ID also contains "A" as a name:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id tna ret) str2 name
1  2  3 "X"
1  3  2 "X"
1  5  3 "X"
1  6 -1 "X"
2  4  2 "X"
2  6 -1 "X"
2  8 -2 "X"
2  9  3 "P"
2 11 -2 "P"
3  3  1 "Y"
3  4  0 "Y"
3  6 -1 "Y"
3  8  1 "Z"
3  6  1 "Z"
4  9  3 "P"
4 11 -2 "P"
4 12  0 "A"
end

clonevar newid = id
group_id newid, match(name)

score 0 · Answer 2 · answered May 01 '16 at 14:47

I am not sure that I understand the definitions here (e.g. tna and ret are not explained; conversely, omit them from a question if they are irrelevant; does "start" imply a process in time?), but why not copy first values of name within each id, and then classify on first names? (With your example data, results are the same.)

clear 
input id tna ret str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
end

sort id, stable 
by id: gen first = name[1] 
egen group = group(first), label 

list, sepby(group) 

     +---------------------------------------+
     | id   tna   ret   name   first   group |
     |---------------------------------------|
  1. |  1     2     3      X       X       X |
  2. |  1     3     2      X       X       X |
  3. |  1     5     3      X       X       X |
  4. |  1     6    -1      X       X       X |
  5. |  2     4     2      X       X       X |
  6. |  2     6    -1      X       X       X |
  7. |  2     8    -2      X       X       X |
  8. |  2     9     3      P       X       X |
  9. |  2    11    -2      P       X       X |
     |---------------------------------------|
 10. |  3     3     1      Y       Y       Y |
 11. |  3     4     0      Y       Y       Y |
 12. |  3     6    -1      Y       Y       Y |
 13. |  3     8     1      Z       Y       Y |
 14. |  3     6     1      Z       Y       Y |
     +---------------------------------------+

Identify group with two variables

2 Answers2