2

After I run a multinomial logistic regression, I am interested in obtaining predicted probabilities.

I found a difference in my estimates if I run:

mlogit cluster_lag i.indipvar1 i.indipvar2 i.indipvar3 indipvar4, rrr vce(cluster clustervar)

margins depvar, atmeans predict(outcome(0))

or instead:

mlogit cluster_lag i.indipvar1 i.indipvar2  i.indipvar3 indipvar4,rrr vce(cluster clustervar)

margins depvar, predict(outcome(0))

I am wondering what really Stata consider when the option atmeans is not specified.

Moreover, i have a categorical variable called "year" with 4 categories: 71, 81 , 91 , 2001. as far I have understood it should be any difference in typing

margins cluster,  at(cluster==0) at (year=( 71 81 91 2001))

or

margins cluster,   at(cluster==0) over(year)

but finally, results are different. Do you have any suggestion about the difference between the two lines?

ggg
  • 73
  • 1
  • 7
  • 1
    The latter calculates the individual-specific predictions that are then averaged over all individuals, whereas the former computes the response at the average values of the predictors. –  May 09 '18 at 14:43
  • 1
    I don't believe this is valid Stata syntax for `margins`. The word `depvar` would be illegal. – dimitriy May 09 '18 at 20:54
  • @DimitriyV.Masterov yes, of course in my syntax I wrote the actual name of my dependent variable. I thought it would be more clear written in this way. – ggg May 10 '18 at 07:56
  • @PearlySpencer so, do you mean that in the latter it consider the actual value for all the covariates for each individual observation? do you have some references about that? Thank you – ggg May 10 '18 at 07:58
  • Have a look [here](https://www3.nd.edu/~rwilliam/stats/Margins01.pdf) and pay attention to the examples provided. –  May 10 '18 at 10:04
  • @ggg If you did that, it would produce an error. It's usually best to use reproducible examples using the bundled dataset. – dimitriy May 10 '18 at 12:35
  • @PearlySpencer thank you for the reference – ggg May 11 '18 at 12:54
  • 5
    I'm voting to close this question as off-topic because it is a statistical question and not a programming one. –  Aug 02 '18 at 06:30

1 Answers1

3

The difference here is between average marginal predictions and predictions at means. The atmeans command instructs margins to produce the latter, while the default is the former.

For example:

margins, predict(outcome(0))

is the same as:

predict newvar
mean newvar

If you do:

margins covariate, predict(outcome(0))

That's the same as:

replace covariate = 1
predict newvar1
replace covariate = 2
predict newvar2
replace covariate = ...
predict newvar...
mean newvar*

for each unique value of the covariate. That is, it is generating counterfactual datasets, changing the value of the specified variable(s) at a given value and leaving all other variables unchanged, then generating model predictions off of the counterfactual dataset.

The atmeans command is the same as specifying the , at() option where every covariate is fixed at its mean before generating predictions. This often doesn't make sense if you have, for example, categorical covariates or if means are not observable or typical.

On your second question, these two are not equivalent:

margins, at(cluster=0) at(year=(1971 1981 1991 2001))
margins, at(cluster=0) over(year)

The , over() option is a subsetting operation, while , at() is a counterfactual operation.

The first line generates a counterfactual (as above) where every observation's value of year is replaced with 1971, then 1981, then 1991, then 2001, generating predictions on each version of the dataset (while holding cluster at 0 for all cases in all counterfactuals).

The second line fixes cluster at 0 for all observations, then splits the data by observed values of year, then generates average predictions on each subset.

Stata's margins reference manual is always the best reference on these things.

Thomas
  • 43,637
  • 12
  • 109
  • 140