could you please give me some hints for identifying the nature of missingness for categorical variables' missing value? I mean, I gave a fast search on google scholar but I didn't find anything related with this. How could I understand if missing-values are missing completely at random, are they missing at random or finally, they are missing not at random? Except studying the domain I can't think anything. Links to some papers are appreciated, Thanks in advance. (I'll add it in sas environment but the question is not specifically related with this language).
Asked
Active
Viewed 132 times
0
-
Welcome. This seems to be a better fit for [Cross Validated](http://stats.stackexchange.com/) (stats site) versus here (programming site). – LJW Nov 24 '14 at 20:19
-
Thanks for the welcome and for the hint! I'll give a try there too :) – stat Nov 24 '14 at 20:32
-
Okay but don't cross-post (post on both sites at the same time); probably best to delete this one and post it over there. – LJW Nov 24 '14 at 20:58
-
1While this is definitely a better question for CV (as it's not asking about specific implementation), it's not really a good question for that site either as it's currently asked. Better would be to spend some time understanding MCAR etc., and then ask more specific questions tailored to the issues you're having understanding it. – Joe Nov 24 '14 at 21:43
1 Answers
0
Since you've tagged this as SAS, one approach you could take would be to create a boolean variable for each of your categorical variables indicating whether or not it has a missing value in each row. Then you could do whatever analysis you like on the frequency of missing values, using the flags. E.g. you could use proc corr
to see if missing values of one variable correlate with values of other variables.
E.g. suppose you have a situation like this:
data example;
set sashelp.class;
if AGE > 14 then call missing(SEX);
SEX_MISSING_FLAG = missing(SEX);
run;
Then you could spot it by running the following:
proc corr data = example outp= corr;
var age weight height sex_missing_flag;
run;
Output:
_TYPE_,_NAME_,Age,Weight,Height,SEX_MISSING_FLAG
MEAN,,13.32,100.03,62.34,0.26
STD,,1.49,22.77,5.13,0.45
N,,19.00,19.00,19.00,19.00
CORR,Age,1.00,0.74,0.81,0.78
CORR,Weight,0.74,1.00,0.88,0.64
CORR,Height,0.81,0.88,1.00,0.55
CORR,SEX_MISSING_FLAG,0.78,0.64,0.55,1.00

user667489
- 9,501
- 2
- 24
- 35
-
Thanks for the try @user667489 but proc corr won't work for categorical variables (both multilevel and dummies). If you'll manage categorical variables like continuous one, code proc corr will work but won't produce any useful result. This because a linear correlation applied to a categorical variable is meaningless. – stat Nov 25 '14 at 23:26
-
It's not the most sophisticated example, but I think the code above demonstrates the sort of thing that you could potentially spot. – user667489 Nov 25 '14 at 23:57