0

I want to model car purchase choice using nested logit approach. The data that I used currently is hypothetical since I want to make myself sure how to handle them before doing the actual questionnaire.

The data contains 1,000 hypothetical choice observation on 568 car models that are nested across 35 manufacturers.

I have read the mlogit package vignette about the Data management, model description and testing and found out about the wide and long data format. Both the two formats example show that all choice set is stored in the dataset like this example of long type dataset:

  case   alt choice dist  cost ivt ovt freq income urban noalt
1    1 train      0   83 28.25  50  66    4     45     0     2
2    1   car      1   83 15.77  61   0    0     45     0     2
3    2 train      0   83 28.25  50  66    4     25     0     2
4    2   car      1   83 15.77  61   0    0     25     0     2
5    3 train      0   83 28.25  50  66    4     70     0     2
6    3   car      1   83 15.77  61   0    0     70     0     2

The choice is indicated by the choice column.

On the other hand, mine looks like this:

  case  brand     model      length  <other_var>
1    1  Mazda     CX-30 GT     4395         ...
2    2  Mercedes~ GLS-Class~   5130         ...
3    3  Maserati  Ghibli S ~   4971         ...
...

I think my dataset above neither is long type or wide as I don't present each people choice set as rows and put choice variable as choice indicator.

My question is:

  1. Do I really need to re-format my data into long or wide in order to be able to estimate the variables?
  2. If yes, how do I do that? I imagine if I choose long data format, for example, I would have 568 row for one person so in total I would have 1,000 * 568 rows.

Thank you so much.

  • 1
    Your data looks pretty much long to me: there is one car model per row, then the column `length` is clearly a separate variable. You can check the functions `pivot_longer()` and `pivot_wider()` in package `tidyr` to see how to easily reshape your data. But from what I see on the page you linked, `mlogit` might be able to take your data directly. You can check [this chapter](https://r4ds.had.co.nz/tidy-data.html#pivoting) for more illustrations of long/wide data and pivoting. – Alexlok Aug 14 '20 at 23:10
  • @Alexlok Thank you so much for your comment, it does give me some understanding. I'm going to check your link. But I'm still in doubt given by your answer: does it mean a long data not always requires every choice set presents in the table as rows with `choice` column giving indication whether they are chosen or not (1 or 0)? Especially in choice data. – Abdul Mubdi Bindar Aug 14 '20 at 23:59
  • I am not familiar with `mlogit` so can't comment on the expectation of that interface. But in general, "long format" means you make your dataframe longer, so that if a factor can have several levels, it will be a single column. So "long format" doesn't require you to have every combination of the variables, it just describes how you store your data. If not all level combination is available, it could mean you're missing a condition, or not be important, that depends on the expectations from `mlogit`. – Alexlok Aug 15 '20 at 00:18

0 Answers0