Differences between R and Stata in handling unidentified categorical variables

Question

I am attempting to use the mlogit package in R to model a student's college major choice at graduation, conditional on in-major GPA, log family income, and first chosen major. First chosen major is a factor variable with all of the possible choices in majorcode except for 6, which represents dropping out of school. For reference, here is sample data for three students:

studentid   majorcode   choice   majorgpa   faminc   firstmajor
1001           1         0         0        9.2      5
1001           2         0         0        9.2      5
1001           3         0         1.9      9.2      5
1001           4         0         0        9.2      5
1001           5         1         3.4      9.2      5
1001           6         0         0        9.2      5
1006           1         1         2.7      10.7     1
1006           2         0         2        10.7     1
1006           3         0         2.8      10.7     1
1006           4         0         0        10.7     1
1006           5         0         3        10.7     1
1006           6         0         0        10.7     1
1019           1         0         0        9.6      5
1019           2         0         0        9.6      5
1019           3         0         0        9.6      5
1019           4         0         0        9.6      5
1019           5         1         3.2      9.6      5
1019           6         0         0        9.6      5

My issue comes when I try to run mlogit. Adding the first major factor variable causes the following error:

> mlogit(choice ~ majorgpa |  1 + faminc + firstmajor,
+   data=mydata,
+   reflevel=6)
Error in solve.default(H, g[!fixed]) : 
system is computationally singular: reciprocal condition number = 1.04405e-16

I'm pretty sure this error occurs because my data does not have any students whose choice is major 3 but whose first major was major 4, preventing identification of one of my factor variables. However, asclogit in Stata is able to run the model and give me results if I use the following command:

asclogit choice majorgpa2, case(studentid) alt(majorcode) casevars(faminc i.firstmajor) base(6)

The estimates include an estimated coefficient for the factor variable that should not be identified (4.firstmajor under majorcode = 3), though the standard error is very large. I can't figure out how Stata could possibly have found a coefficient on this variable - normally I would have assumed Stata would drop the variable because of the empty cell. Could anyone shed light on the differences between the way R solves mlogit and Stata solves asclogit, or maximum likelihood in general, that might produce this weird issue?

I think it will be really difficult for anyone to respond to your question without a realistic example dataset. — , Oct 12 '18 at 16:36
Thanks. I have edited the sample data to add more students, but note that to comply with my data use agreement I've changed numbers around. — Avery, Oct 12 '18 at 16:50
can you try the mlogit in STATA and see what are the results? — Yan Song, Oct 13 '18 at 11:02

Differences between R and Stata in handling unidentified categorical variables

0 Answers0