1

I have a dataset that looks like this:

Observation  Outcome  VariableA  VariableB   VariableC
     1          1         1.27       0.2         0.81        
     2          0         0.30       0.45        0.70           
     3         -1         0.27       1.2         0.56 

The Outcome variable can take on the values 1, 0, -1 and is supposed to be the dependent variable in a multinomial logit model which I will implement in R using the mlogit package. I have transformed my data using the following code:

mlogitdataset <- mlogit.data(dataset, choice = "Outcome", shape="wide")

which gives me the following new dataset:

Observation  Outcome VariableA  VariableB  VariableC   alt
     1        FALSE       1.27       0.2        0.81   -1     
     1        FALSE       1.27       0.2        0.81    0      
     1         TRUE       1.27       0.2        0.81    1
     2        FALSE       0.20       0.45       0.70   -1
     2         TRUE       0.20       0.45       0.70    0   
     2        FALSE       0.20       0.45       0.70    1

This is essentially how I want the data to be structured, however, I do not want to use VariableA-C as separate independent variables in the multinomial logit regression. Instead, I want the independent variable to take on a value either from Variable A, B or C depending on the value of alt. This can be represented by VariableD in the table below:

 Observation  Outcome VariableA  VariableB  VariableC   alt  VariableD
     1        FALSE       1.27       0.20       0.81   -1       0.81
     1        FALSE       1.27       0.20       0.81    0       0.20
     1         TRUE       1.27       0.20       0.81    1       1.27
     2        FALSE       0.20       0.45       0.70   -1       0.70
     2         TRUE       0.20       0.45       0.70    0       0.45
     2        FALSE       0.20       0.45       0.70    1       0.20

This would allow me to run the multinomial logit regression:

mlog <- mlogit(Outcome ~ 1 | VariableD, data=mlogitdataset, reflevel = "0") 

I have tried to create VariableD directly within the mlogit object (mlogitdataset) using the following code:

outcome_map <- data.frame(alt = c(1, 0, -1), var = grep('Variable[A-C]', names(mlogitdataset)))

mlogitdataset$VariableD <- mlogitdataset[cbind(seq_len(nrow(mlogitdataset)), with(outcome_map, var[match(mlogitdataset$alt, alt)]))]

However, that gives me the error message "row names supplied are of the wrong length" when trying to run the multinomial logit regression.

How should I transform/format/structure the data so that I can run the intended regression using the mlogit function?

Thanks!

carsentdum
  • 57
  • 6

1 Answers1

1

You can use case_when() from dplyr together with mutate():

library(dplyr)

mlogitdataset <- read.csv(text = "Observation,Outcome,VariableA,VariableB,VariableC,alt
1,FALSE,1.27,0.20,0.81,-1
1,FALSE,1.27,0.20,0.81,0
1,TRUE,1.27,0.20,0.81,1
2,FALSE,0.20,0.45,0.70,-1
2,TRUE,0.20,0.45,0.70,0
2,FALSE,0.20,0.45,0.70,1")

mlogitdataset <- mutate(mlogitdataset, 
       VariableD = case_when(
         alt == -1 ~ VariableC,
         alt ==  0 ~ VariableB,
         alt ==  1 ~ VariableA
       ))
fujiu
  • 501
  • 4
  • 9
  • This allows me to construct VariableD the way I want, however, when I run the mlogit regression using it as the independent variable I still get the error message: "Error in data.frame(lapply(index, function(x) x[drop = TRUE]), row.names = rownames(mydata)) :row names supplied are of the wrong length" Any idea how this can be solved? – carsentdum Feb 27 '19 at 12:00
  • Unfortunately, I am not familiar with the mlogit package, but the error seems to be caused by incorrect (or missing) rownames. In my example, we actually don't create any rownames, but this seems to be something that `mlogit` is looking for, possibly to connect the rows belonging to the same observation? So this seems to be less about the way we create the new variable and more about the format that `mlogit` expects as the input data format. – fujiu Feb 27 '19 at 12:49