Splitsample in Stata 16: How to create samples based on varying proportions saved in a variable?

Question

Datastructure: I use panel data in which an observation represents a certain individual in a given year (2015-2021). Only observations are included of individuals who are between the 15 and 25 years old. There are 2857 observations of 1373 individuals in total.

Goal: The goal is to investigate the effect of a policy change in 2018. In doing so, I designed a quasi-experimental design in which there are two controlgroups and a treatmentgroup defined in terms of their age:

Controlgroup A: individuals 15-17 years old
Treatmentgroup: individuals 18-22 years old
Controlgroup B: individuals 23-25 years old

Dividing individuals into treatment and controlgroups based on varying chance: due to methodological reasons, individuals selected in a controlgroup may not become part of the treatment group (due to aging over time) and vice versa. Therefore I am confronted with the question how to select the right individuals (given their age and the year) for the treatment and controlgroups. To ensure that every year has observations of individuals in all ages, I came up with the following design (see picture). There are 17 theoretically possible individuals in my data (vertical as in the picture) who age over 7 years (2015-2021). I would like to sample the individuals into the treatment and controlgroups based on the chances mentioned in the table beneath to ensure all ages are represented in all years.

Programming I constructed a variable (1-17) indicating what number an individual represents (like the vertical numbers in the table above)

gen individualnumber=(age-year)+2007

I constructed three variables indicating the chances of being in controlgroup A, B or treatment in the following way:

 gen Chanceofbeingcontrol_1517=0
replace Chanceofbeingcontrol_1517=1 if individualnumber==1 | individualnumber==2 | individualnumber==3
replace Chanceofbeingcontrol_1517=0.75 if individualnumber==4
replace Chanceofbeingcontrol_1517=0.60 if individualnumber==5
replace Chanceofbeingcontrol_1517=0.50 if individualnumber==6
replace Chanceofbeingcontrol_1517=0.43 if individualnumber==7
replace Chanceofbeingcontrol_1517=0.29 if individualnumber==8
replace Chanceofbeingcontrol_1517=0.14 if individualnumber==9

gen Chanceofbeingcontrol_2325=0
replace Chanceofbeingcontrol_2325=1 if individualnumber==15 | individualnumber==16 | individualnumber==17
replace Chanceofbeingcontrol_2325=0.75 if individualnumber==14
replace Chanceofbeingcontrol_2325=0.60 if individualnumber==13
replace Chanceofbeingcontrol_2325=0.50 if individualnumber==12
replace Chanceofbeingcontrol_2325=0.43 if individualnumber==11
replace Chanceofbeingcontrol_2325=0.29 if individualnumber==10
replace Chanceofbeingcontrol_2325=0.14 if individualnumber==9

gen Chanceofbeingtreated=1-(Chanceofbeingcontrol_1517+Chanceofbeingcontrol_2325)

After that I wanted to construct the samples...

splitsample, generate(treatedornot) split(Chanceofbeingcontrol_1517 Chanceofbeingtreated Chanceofbeingcontrol_2325) cluster(individualnumber) rround show

...but I received an error since only a numlist might be used in the split(numlist) subcommand.

Question: How to construct the samples or overcome this error in an efficient way?

Example: An individuals (number 7 in the table) who is 15 years old in 2015 (controlgroup 1 age), will be 18 years old in 2018 (which is the treatment age). But this individual may not be part of both the treatment and controlgroup and should therefore be a member of one of the two. Therefore I want to draw three random samples among all number 7 individuals. Let's state there are 100 individuals like individual 7 in the table.

Sample 1 is controlgroup A and individual 7 will occur 43 times in this sample
Sample 2 is the treatment group so individual 7 occurs 57 times in this sample
While individual 7 will not occur in sample 3 since this person is never older than 22 during 2015-2021.

TheIceBear · Answer 1 · 2022-06-06T15:48:56.573

What's common for all people who were 9 in 2015, 10 in 2016, 11 in 2017 is that they were born 2006. And all who were 10 in 2015 was born 2005. So instead of a variable individualnumber that can be hard to understand for someone who reads your code, why don't you create a variable called birthyear. That will make it easier to explain your design to your peers.

Regardless of what you call the variable or what the value it contains represent, I would solve it something like this. You will probably need to tweak this code. Provide a replicable subset of your data (see the command dataex) if you want a replicable answer.

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year double age
1 2017 15
1 2017 15
2 2017 15
2 2017 15
3 2017 15
3 2017 15
4 2017 15
4 2017 15
5 2015 12
5 2015 12
end

* Create the var that will display the 
gen birthyear = year-age

    preserve

    * Collapse year-person level data to person level so 
    * that each individual only get one treatment status.
    * You must have an individual id number for this
    * Get standard deviation to test that data is good and the birthyear
    * is identical for each individual across the panel data set
    collapse (mean) birthyear (sd) bysd=birthyear, by(id)

    * Test that birthyear is same across all indivudals - this is not needed,
    * but good data quality assurance test. Then drop the var as it is not needed
    assert bysd == 0
    drop bysd

    * Set seed to make replicable. Replace this seed when you have tested this
    * script using a new random seed. For example from here:
    * https://www.random.org/integers/?num=1&min=100000&max=999999&col=5&base=10&format=html&rnd=new
    set seed 123456

    *Generate a random number based on the seed
    gen random_draw = runiform()

    * For each birthyear, get the rank of the random number divided by the number
    * of individuals in each birthyear
    sort birthyear random_draw
    by birthyear : gen percent_rank = _n/_N

    *Initiate treatmen variable
    gen tmt_status = .
    label define tmt_status 0 "Treated" 1 "ControlA" 2 "ControlB"

    *Assign birthyear 2006-2004 that are all the same
    replace tmt_status = 1 if birthyear == 2006
    replace tmt_status = 1 if birthyear == 2005
    replace tmt_status = 1 if birthyear == 2004

    *Assign birthyear 2003
    replace tmt_status = 0 if birthyear == 2003 & percent_rank <= .25
    replace tmt_status = 1 if birthyear == 2003 & percent_rank >  .25

    *Assign birthyear 2002
    replace tmt_status = 0 if birthyear == 2002 & percent_rank <= .40
    replace tmt_status = 1 if birthyear == 2002 & percent_rank >  .40

    *Fill in birthyear 2001-1999

    *Assign year 1998
    replace tmt_status = 0 if birthyear == 1998 & percent_rank <= .72
    replace tmt_status = 1 if birthyear == 1998 & percent_rank >  .72 & percent_rank <= .86
    replace tmt_status = 2 if birthyear == 1998 & percent_rank >  .86

    *Fill in birthyear 1997-1990

    * Do some tabulates etc to convince yourself the randomization is as expected

    * Save tempfile of data to be merged to later
    * (Consider saving this as a master data set https://worldbank.github.io/dime-data-handbook/measurement.html#constructing-master-data-sets)
    tempfile assignment_results
    save `assignment_results'
    
restore

merge m:1 id using `assignment_results'

This code can be made more concise using loop, but random assignment is so important as I personally always go for clarity over conciseness when doing this.

This is not answering specifically about splitsample, but it addresses what you are trying to do. You will have to decide how you want to do with groups that does not have a size that can be split into the exact ratio you prefer.

Splitsample in Stata 16: How to create samples based on varying proportions saved in a variable?

1 Answers1