Replace a subset of data with a sequence of integers

Question

I have a dataset that looks like the following:

id    test
1     A
2     A
3     A
.     B
.     B
.     B

I would like to fill in the missing values with a sequence of integers corresponding to their id:

id    test
1     A
2     A
3     A
1     B
2     B
3     B

I initially thought about using a forvalues loop as follows:

forvalues i=1/3 {
    replace id = `i' if (id == .)
}

But that (for obvious reasons) just replaces all the missing values with 1 during the first iteration.

I then thought about restricting the loop to a subset of the data by row indexing (like in R), but this functionality doesn't seem to exist for Stata(?):

forvalues i=1/3 {
   replace id[3+`i'] = `i' if (id == .)
}

How can I go about this seemingly simple task in Stata?

This is a very sparse description of the assumptions surrounding your data. Are there always 3 id's per test? Are the distinct values of test grouped together, or are there multiple batches of observations with the same value for test? Is there another variable that determines what order the 3 (or whatever) observations for the same value of test appear in? Are you certain that the missing values of id are all-or-nothing, that there isn't a value of test for which only 1 of 3 (or whatever) values of id are missing? — , Sep 05 '15 at 15:38

score 0 · Answer 1 · answered Sep 05 '15 at 14:55

0

Ok, thinking it over some more, I think I've solved my own problem. The easiest solution utilizes Stata's _n-function without resorting to loops:

replace id = _n - 3 if (id == .)

This seems to do the trick.

answered Sep 05 '15 at 14:55

ageil

171
1
3
16

1

This works if you have only 6 observations. If you want identifiers to repeat 1, 2, 3, then it won't work. – Nick Cox Sep 05 '15 at 14:59
And if you do have only 6 observations, using Stata's Data Editor, you could replace the missing values manually. :-) – Sep 05 '15 at 15:31

Nick Cox · Answer 2 · 2015-09-05T15:13:48.913

0

One of various possibilities (for repeating identifiers 1, 2, 3, 1, 2, 3, ..., which is how I read the question) is

replace id = 1 + mod(_n - 1, 3) if missing(id)

or check out help egen for a seq() function.

egen newid = seq(), to(3) 
replace id = newid if missing(id)

Given an initial 1, 2, 3 in the first three observations, then

replace id = id[_n-3] if missing(id)

will also work.

edited Sep 05 '15 at 15:13

answered Sep 05 '15 at 14:57

Nick Cox

35,529
6
31
47

score 0 · Answer 3 · answered Sep 05 '15 at 15:26

0

Or perhaps

replace id = cond(test==test[_n-1],id[_n-1]+1,1) if missing(id)

which does not require that there are exactly three id's per test.

answered Sep 05 '15 at 15:26

Meru Bhanot · Answer 4 · 2015-09-16T16:56:07.523

If all of your IDs start with the number 1, and you really just want to number them, something like this should work:

sort test id
by test: gen id2 = _n
replace id = id2 if missing(id)

Note that you should alter the first sort as needed. By nature of your question, there's no sense in which the order of the items with test == B is "ordered" so you should make sure that the initial sort puts them in a desirable order.

EDIT: As Nick Cox noted, this code reduces to a one liner, which is preferable if this is the desired action. I wrote it out as 3 lines for simplicity, as well as to note the following facts:

The sort must be taken into account, whether you do it independently or in the bysort command, to ensure the labels align with the programmer's goal
In case you want to spot check the new IDs against the old, it may be useful to be able to visually see id2, just to ensure that the code is working.

Once both of these are assured, I agree that Nick's code from the comment below is preferable.

This reduces to `bysort test (id) : replace id = _n if missing(id)`. — Nick Cox, Sep 16 '15 at 15:44

score 0 · Answer 5 · answered Jun 05 '18 at 13:29

For instructive purposes, @Nick's egen approach can be generalized further:

clear

input id str1 test
1     A
2     A
3     A
.     B
.     B
.     C
.     C
.     C
.     C
.     D
.     E
.     E
.     E
end

levelsof test, local(test_levels)
local number_of_test_levels : word count `test_levels'

forvalues i = 1 / `number_of_test_levels' {
      count if missing(id) & test == "`=char(64 + `i')'"
      if `r(N)' != 0 {
          egen newid`i' = seq(), from(1) to(`r(N)')
          replace id = newid`i' if missing(id) & test == "`=char(64 + `i')'"
          drop newid`i'
      }
}

Which produces the expected output:

sort test id  
list

     +-----------+
     | id   test |
     |-----------|
  1. |  1      A |
  2. |  2      A |
  3. |  3      A |
  4. |  1      B |
  5. |  2      B |
     |-----------|
  6. |  1      C |
  7. |  2      C |
  8. |  3      C |
  9. |  4      C |
 10. |  1      D |
     |-----------|
 11. |  1      E |
 12. |  2      E |
 13. |  3      E |
     +-----------+

Replace a subset of data with a sequence of integers

5 Answers5