Multiple imputation and mlogit for a multinomial regression

Question

I am trying to run a multinomial regression with imputed data. I can do this with the nnet package, however I want to use mlogit. Using the mlogit package I keep getting the following error "Error in 1:nrow(data) : argument of length 0".

So making the data

library(mlogit)
library(nnet)
library(tidyverse)
library(mice)

df <- data.frame(vax = sample(1:6, 500, replace = T),
                 age = runif(500, 12, 18),
                 var1 = sample(1:2, 500, replace = T),
                 var2 = sample(1:5, 500, replace = T))

# Create missing data using the mice package:
df2 <- ampute(df, prop = 0.15)
df3 <- df2$amp

df3$vax <- as.factor(df3$vax)
df3$var1 <- as.factor(df3$var1)
df3$var2 <- as.factor(df3$var2)

# Inpute missing data:
df4 <- mice(df3, m = 5, print = T, seed = 123)

It works using nnet's multinom:

multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, data = df, model = T))
summary(pool(multinomtest))

But throws up an error when I try to reshape the data into mlogit format

test <- with(df4, dfidx(data = df4, choice = "vax", shape = "wide"))

Does anyone have any idea how I can get the imputed data into mlogit format, or even whether mlogit has compatibility with mice or any other imputation package?

score 3 · Accepted Answer · answered Aug 01 '21 at 07:07

Answer

You are using with.mids incorrectly, and thus both lines of code are wrong; the multinom line just doesn't give an error. If you want to apply multiple functions to the imputed datasets, you're better off using something like lapply:

analyses <- lapply(seq_len(df4$m), function(i) {
  data.i <- complete(df4, i)
  data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
  mlogit(vax ~ 1 | age + var1 + var2, 
         data = data.idx, 
         reflevel = "1", 
         nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})
test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))

How with.mids works

When you apply with to a mids object (AKA the output of mice::mice), then you are actually calling with.mids.

If you use getAnywhere(with.mids) (or just type mice:::with.mids), you'll find that it does a couple of things:

It loops over all imputed datasets.
It uses complete to get one dataset.
It runs the expression with the dataset as the environment.

The third step is the problem. For functions that use formulas (like lm, glm and multinom), you can use that formula within a given environment. If the variables are not in the current environment (but rather in e.g. a data frame), you can specify a new environment by setting the data variable.

The problems

This is where both your problems derive from:

In your multinom call, you set the data variable to be df. Hence, you are actually running your multinom on the original df, NOT the imputed dataset!
In your dfidx call, you are again filling in data directly. This is also wrong. However, leaving it empty also gives an error. This is because with.mids doesn't fill in the data argument, but only the environment. That isn't sufficient for you.

Fixing multinom

The solution for your multinom line is simple: just don't specify data:

multinomtest <- with(df4, multinom(vax ~ age + var1 + var2, model = T))
summary(pool(multinomtest))

As you will see, this will yield very different results! But it is important to realise that this is what you are trying to obtain.

Fixing dfidx (and mlogit)

We cannot do this with with.mids, since it uses the imputed dataset as the environment, but you want to use the modified dataset (after dfidx) as your environment. So, we have to write our own code. You could just do this with any looping function, e.g. lapply:

analyses <- lapply(seq_len(df4$m), function(i) {
  data.i <- complete(df4, i)
  data.idx <- dfidx(data = data.i, choice = "vax", shape = "wide")
  mlogit(vax ~ 1 | age + var1 + var2, data = data.idx, reflevel = "1", nests = list(type1 = c("1", "2"), type2 = c("3","4"), type3 = c("5","6")))
})

From there, all we have to do is make something that looks like a mira object, so that we can still use pool:

test <- list(call = "", call1 = df4$call, nmis = df4$nmis, analyses = analyses)
oldClass(test) <- c("mira", "matrix")
summary(pool(test))

CzechInk · Answer 2 · 2021-08-02T18:45:28.333

Offering this as a way forward to circumvent the error with dfidx():

df5 <- df4$imp %>% 
  # work with a list, where each top-element is a different imputation run (imp_n)
  map(~as.list(.x)) %>%
  transpose %>%
  # for each run, impute and return the full (imputed) data set
  map(function(imp_n.x) {
    df_out <- df4$data
    df_out$vax[is.na(df_out$vax)] <- imp_n.x$vax
    df_out$age[is.na(df_out$age)] <- imp_n.x$age
    df_out$var1[is.na(df_out$var1)] <- imp_n.x$var1
    df_out$var2[is.na(df_out$var2)] <- imp_n.x$var2
    return(df_out)
  }) %>%
  # No errors with dfidx() now
  map(function(imp_n.x) {
    dfidx(data = imp_n.x, choice = "vax", shape = "wide")
  })

However, I'm not too familiar with mlogit(), so can't help beyond this.

Update 8/2/21

As @slamballais mentioned in their answer, the issue is with dataset you refer to when fitting the model. I assume that mldata (from your code in the comments section) is a data.frame? This is probably why you are seeing the same coefficients - you are not referring to the imputed data sets (which I've identified as imp_n.x in the functions). The function purrr::map() is very similar to lapply(), where you apply a function to elements of a list. So to get the code working properly, you would want to change mldata to imp_n.x:

# To fit mlogit() for each imputed data set
df5 %>%
  map(function(imp_n.x) {
    # form as specified in the comments
    mlogit(vax ~ 1 | age + var1 + var2, 
           data = imp_n.x, 
           reflevel = "1", 
           nests = list(type1 = c('1', '2'), 
                        type2 = c('3','4'), 
                        type3 = c('5','6')))
  })

@Czechlnk this works. I can add my mlogit formula to it by continuing on the pipe: '%>% map(function(imp_n.x) { mlogit(vax ~ 1 | age + var1 + var2, data = mldata, reflevel = "1", nests = list(type1 = c('1', '2'), type2 = c('3','4'), type3 = c('5','6'))) })'. My only conern is the output for each imputation is exactly the same, and the same as if I were to run the mlogit on an unimputed dataset. One would expect sligtly different coefficients. I'm not sure exactly what the map(function(imp_n.x) is doing? — itsapuntis, Aug 01 '21 at 06:32

Multiple imputation and mlogit for a multinomial regression

2 Answers2