Attrition in panel data - Stata

Question

I am constructing a panel dataset based on the survey data for the years 2010-2013 (four consecutive years). As is usually the case with household survey data, there is an issue of attrition, i.e. some households drop out from the survey from year to year. I need to figure out whether these households are missing at random.

My idea is to come up with a dummy equal to 1 in 2011 if a household present in 2010 is missing in 2011 (and 0 otherwise), and so on for the years 2012, 2013. Then I want to run the logit/probit regression on this dummy with a set of covariates that I would like to control for in my study. The variable for household id is "hhid" and I have of course the time dimension variable "year".

Does anyone have a precise idea how this should be properly coded in Stata? I know it is not complicated, but I just cannot wrap my head around it and figure this out....

Cross-posted at https://www.statalist.org/forums/forum/general-stata-discussion/general/1667035-household-attrition-in-panel-data Telling people about cross-posting is always a good idea — Nick Cox, May 31 '22 at 10:35

TheIceBear · Answer 1 · 2022-05-31T09:20:53.997

Your question is if there is a difference in the households you do not observe in year X compare to those you do observe in year X. There is no perfect way to answer this question as you, by definition, did not observe those households.

You did however observe all households in your study in year 0 (2010 in your case). As you imply yourself, you can use observations in year 0 as a proxy to answer if those households are different in year X. I can help you show how you can code this, but StackOverflow is not the appropriate forum to discuss is this is statistically valid given your data, how it was collected and what analysis you intend to use.

One way to code this is to use iebaltab in the package called ietoolkit available from SSC (disclosure, I wrote that command).

You can create an attrition dummy indicating attrition and use iebaltab like this: iebaltab balancevars, grpvar(attrition) where balancevars is a list of variables for characteristics in the household where you want to make sure they were similar in year 0. You can use the option ftest to include the test across all balance variables they way you are suggesting.

Not that this command generates statistics, but it is up to you to decide if this is valid, and the validity of balance tests are hotly debated. But those debates are not about coding which StackOverflow is about.

My question is purely about coding: I just wanted to know how to create a dummy equal to 1 for a household present in 2010 and absent in 2011, and 0 otherwise. So it is purely a coding question. I know it might a bit silly question, but I really cannot figure out how to code it properly. — Joker312, Jun 01 '22 at 09:57
ok, I was not able to deduct that that was your question. Title and most of the question was about attrition. I have posted a new answer. Example datasets where you show what you have and what you want usually helps avoiding this confusion — TheIceBear, Jun 01 '22 at 10:14

score 1 · Accepted Answer · answered Jun 01 '22 at 10:13

Here is an example on how you create a dummy in a panel data and then collapse those dummy to the parent unit-of-observation making the dummy 1 if the parent unit-of-observation was 1 in any time period. Then merge the parent unit-of-observation level data back to the panel data.

* Example generated by -dataex-. For more info, type help dataex
clear
input byte hhid int year
1 2010
1 2011
1 2012
1 2013
2 2010
2 2011
2 2013
3 2010
3 2011
end

*Create a dummy for each year-hh level observation for each year
local year_dummies ""
forvalues year = 2010/2013 {
    gen dummy`year' = (year==`year')
    local year_dummies "`year_dummies' dummy`year'"
}

*Collapse the data set to hh level where the dummies is 1 if any year-hh level was 1
preserve
    collapse (max) `year_dummies' , by(hhid)
    tempfile year_dummy_hhlevel
    save `year_dummy_hhlevel'
restore

*Rename to not having to overwrite the first step
rename dummy???? org_dummy????

*Merge the hh level data back to the year-hh level 
*data merging the hh dummy to each year-hh observation
merge m:1 hhid using `year_dummy_hhlevel', nogen

thanks a lot for your answer. I was confused for a while about the code, but then figured out it just creates of a set of dummies "dummy2010", "dummy2011", "dummy2012" and "dummy2013". And then the "dummy2010" is 1 if a household is present in the year 2010 and 0 otherwise, and so on for other years. Basically, I can then create, e.g., a dummy "present2010absent2011" which is 1 if a household is present in 2010 and absent in 2011 by just coding: gen present2010absent2011=1 if (dummy2010==1 & dummy2011==0) replace present2010absent2011=0 if present2010absent2011==. — Joker312, Jun 03 '22 at 11:38

Attrition in panel data - Stata

2 Answers2