Stata comparing lists and finding missing numbers

Question

I have a question that is probably very simple but I can't figure it out right now.

I have two long lists of index numbers, they are identical except for the fact that the first list contains some numbers that the second list does not and thus the order is different as well.

Here is an example of the two lists:

   index1   index2
    10002   10005
    10005   10006
    10006   10009
    10009   10025
    10019   10028
    10020   10048
    10025   10050

Is there any way in which I can find the index numbers that are in the first list but are missing in the second one? I tried using loops, but the changing order makes it quite hard.

Manually is not an option, as the list is over 7000 numbers long.

Thank you!

score 2 · Accepted Answer · answered Oct 14 '14 at 18:01

How about something like:

clear
set more off

*----- example data -----

input ///
   index1   index2
    10002   10005
    10005   10006
    10006   10009
    10009   10025
    10019   10028
    10020   10048
    10025   10050
end

*----- what you want -----

levelsof index1, local(i1)
levelsof index2, local(i2)

local diff: list i1 - i2

display "`diff'"

You want to check help macro lists and help limits to verify the limit for the "# of characters in a macro".

@Roberto gets the accept. My contribution was `levelsof`! – Nick Cox Oct 14 '14 at 18:49 — Nick Cox, Oct 14 '14 at 18:49

score 2 · Answer 2 · answered Oct 14 '14 at 18:32

Another way to do it is to stack the variables into one. This is destructive, so ensure that the dataset is saved first. That produces a variable _stack which tags which is which. Here values coming from the first variable, index1, are automatically tagged 1 and those from the second variable, index2, are tagged 2. With other names some minor manipulation would be needed to show the correspondence.

We can then condense the dataset according to whether an index occurs in 1 only, 2 only, or both, which we label 3.

The question asks about those in list 1 but not list 2, but the code keeps the generality of comparisons both ways. If there are indexes in 2 but not 1, you will get to see them. You can naturally work at the new dataset as you wish.

Yet another way would work with some merge of the dataset with a partial copy of itself.

. clear 

. input index1   index2

    index1     index2
1.     10002   10005
2.     10005   10006
3.     10006   10009
4.     10009   10025
5.     10019   10028
6.     10020   10048
7.     10025   10050
8. end 

. stack index1 index2, into(index) clear 

. duplicates drop 

Duplicates in terms of all variables

(0 observations are duplicates)

. l

     +----------------+
     | _stack   index |
     |----------------|
  1. |      1   10002 |
  2. |      1   10005 |
  3. |      1   10006 |
  4. |      1   10009 |
  5. |      1   10019 |
     |----------------|
  6. |      1   10020 |
  7. |      1   10025 |
  8. |      2   10005 |
  9. |      2   10006 |
 10. |      2   10009 |
     |----------------|
 11. |      2   10025 |
 12. |      2   10028 |
 13. |      2   10048 |
 14. |      2   10050 |
     +----------------+

 . bysort index (_stack) : gen which = cond(_N == 2, 3, _stack) 

 . bysort index: keep if _n == 1
 (4 observations deleted)

 . l index which  

     +---------------+
     | index   which |
     |---------------|
  1. | 10002       1 |
  2. | 10005       3 |
  3. | 10006       3 |
  4. | 10009       3 |
  5. | 10019       1 |
     |---------------|
  6. | 10020       1 |
  7. | 10025       3 |
  8. | 10028       2 |
  9. | 10048       2 |
 10. | 10050       2 |
     +---------------+

Stata comparing lists and finding missing numbers

2 Answers2