1

In Stata, I want to create a new variable with values associated with probabilities from a know distribution.

Say that the distribution pdf looks like:

Blue - .2
Red - .3
Green - .5

I can use code like the following to get the exact distribution as above. First, is there a quicker way to accomplish this?

gen Color = ""
replace Color = "Blue" if _n <= _N*.2
replace Color = "Red" if _n > _N*.2 & _n <= _N*.5
replace Color = "Green" if Color==""

To simulate random draws, I think I can do:

gen rand = runiform()
sort rand
gen Color = ""
replace Color = "Blue" if rand <= .2
replace Color = "Red" if rand > .2 & rand <= .5
replace Color = "Green" if Color==""

Is this technique best practice?

bill999
  • 2,147
  • 8
  • 51
  • 103

1 Answers1

1

When producing the data, you could use the more efficient in instead of if. But to be honest, I believe the data set would have to be very big for time differences to be perceivable. You can do some experimenting to check for that.

The second issue on random draws is already addressed by a series of posts authored by Bill Gould (StataCorp's president). Some code below with inline comments. You can run the whole thing and check the results.

clear
set more off

*----- first question -----

/* create data with certain distribution */

set obs 100
set seed 23956

gen obs = _n
gen rand = runiform()
sort rand

gen Color = ""

/* 
// original
replace Color = "Blue" if _n <= _N*.2
replace Color = "Red" if _n > _N*.2 & _n <= _N*.5
replace Color = "Green" if Color==""
*/

// using -in-
replace Color = "Blue" in 1/`=floor(_N*.2)'
replace Color = "Red" in `=floor(_N*.2) + 1'/`=floor(_N*.5)'
replace Color = "Green" in `=floor(_N*.5) + 1'/L

/* 
// using -cond()-
gen Color = cond(_n <= _N*.2, "Blue", cond(_n > _N*.2 & _n <= _N*.5, "Red", "Green"))
*/

drop rand
sort obs

tempfile allobs
save "`allobs'"

tab Color

*----- second question -----

/* draw without replacement a random sample of 20 
observations from a dataset of N observations */

set seed 89365
sort obs // for reproducibility
generate double u = runiform()
sort u
keep in 1/20

tab obs Color

/* If N>1,000, generate two random variables u1 and u2 
in place of u, and substitute sort u1 u2 for sort u */

/* draw with replacement a random sample of 20 
observations from a dataset of N observations */

clear

set seed 08236
drop _all
set obs 20
generate long obsno = floor(100*runiform()+1)
sort obsno
tempfile obstodraw
save "`obstodraw'"

use "`allobs'", clear
generate long obsno = _n
merge 1:m obsno using "`obstodraw'", keep(match) nogen

tab obs Color

These and other details can be found in the four-part series on random-number generators, by Bill Gould: http://blog.stata.com/2012/10/24/using-statas-random-number-generators-part-4-details/

See also help sample!

Roberto Ferrer
  • 11,024
  • 1
  • 21
  • 23