Another way to do it is to stack
the variables into one. This is destructive, so ensure that the dataset is save
d first. That produces a variable _stack
which tags which is which. Here values coming from the first variable, index1
, are automatically tagged 1 and those from the second variable, index2
, are tagged 2. With other names some minor manipulation would be needed to show the correspondence.
We can then condense the dataset according to whether an index occurs in 1 only, 2 only, or both, which we label 3.
The question asks about those in list 1 but not list 2, but the code keeps the generality of comparisons both ways. If there are indexes in 2 but not 1, you will get to see them. You can naturally work at the new dataset as you wish.
Yet another way would work with some merge
of the dataset with a partial copy of itself.
. clear
. input index1 index2
index1 index2
1. 10002 10005
2. 10005 10006
3. 10006 10009
4. 10009 10025
5. 10019 10028
6. 10020 10048
7. 10025 10050
8. end
. stack index1 index2, into(index) clear
. duplicates drop
Duplicates in terms of all variables
(0 observations are duplicates)
. l
+----------------+
| _stack index |
|----------------|
1. | 1 10002 |
2. | 1 10005 |
3. | 1 10006 |
4. | 1 10009 |
5. | 1 10019 |
|----------------|
6. | 1 10020 |
7. | 1 10025 |
8. | 2 10005 |
9. | 2 10006 |
10. | 2 10009 |
|----------------|
11. | 2 10025 |
12. | 2 10028 |
13. | 2 10048 |
14. | 2 10050 |
+----------------+
. bysort index (_stack) : gen which = cond(_N == 2, 3, _stack)
. bysort index: keep if _n == 1
(4 observations deleted)
. l index which
+---------------+
| index which |
|---------------|
1. | 10002 1 |
2. | 10005 3 |
3. | 10006 3 |
4. | 10009 3 |
5. | 10019 1 |
|---------------|
6. | 10020 1 |
7. | 10025 3 |
8. | 10028 2 |
9. | 10048 2 |
10. | 10050 2 |
+---------------+