I'm using a statistical library mlogit
which has an inefficient routine to create a balanced data.frame from an unbalanced one. With my particular dataset an intermediate data.frame mf
is produced with several 100,000 rows. The problematic line of code is:
mf <- mf[all.rn, ]
all.rn
is a character vector used to index the data.frame of the form:
"701364.1" "701364.3" "701364.4" "701364.5" "701364.6" "701364.7"
"701364.8" "701364.9" "701364.2" "701364.12"
"701364.10" "701364.11" "701364.13" "701364.14" "701364.15" "701364.16"
"701364.17" "701364.18" "701364.19" "701364.20"
"701364.21" "701364.23" "701364.24" "701364.22" "701364.27" "701364.28"
"701364.30" "701364.37" "701364.38" "701364.39"
"701364.25" "701364.26" "701364.29" "701365.1" "701365.3" "701365.4"
"701365.5" "701365.6" "701365.7" "701365.8"
"701365.9" "701365.2" "701365.12" "701365.10" "701365.11" "701365.13"
"701365.14" "701365.15" "701365.16" "701365.17"
which consists of two numbers. The first corresponds to a particular event and the second to the choices available for that event which varies between events. Not all choices are available.
The original mf
data.frame looks like this:
701364.1 FALSE 1.191801e-02 11.88888889
701364.3 FALSE 2.715409e-01 7.88888889
701364.4 FALSE -3.202290e-02 4.88888889
701364.5 FALSE -1.940157e-01 -3.11111111
701364.6 FALSE 5.653818e-02 -4.11111111
701364.7 FALSE 2.081075e-02 -7.11111111
701364.8 FALSE -1.819507e-01 -8.11111111
701364.9 TRUE -1.491018e-01 -11.11111111
701365.1 FALSE 2.354772e-01 3.44444444
701365.2 TRUE 1.141553e-01 3.44444444
701365.3 FALSE -3.000000e-01 3.44444444
701365.4 FALSE 3.585301e-02 3.44444444
701365.8 FALSE -2.321651e-02 -3.55555556
701367.1 FALSE 2.154056e-01 5.20000000
701367.2 FALSE -7.043655e-03 2.20000000
while the resulting data.frame after the routine is balanced with missing choices filed in with NAs:
701364.1 FALSE 1.191801e-02 11.88888889
701364.3 FALSE 2.715409e-01 7.88888889
701364.4 FALSE -3.202290e-02 4.88888889
701364.5 FALSE -1.940157e-01 -3.11111111
701364.6 FALSE 5.653818e-02 -4.11111111
701364.7 FALSE 2.081075e-02 -7.11111111
701364.8 FALSE -1.819507e-01 -8.11111111
701364.9 TRUE -1.491018e-01 -11.11111111
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
NA.9 NA NA NA
NA.10 NA NA NA
NA.11 NA NA NA
NA.12 NA NA NA
NA.13 NA NA NA
NA.14 NA NA NA
NA.15 NA NA NA
NA.16 NA NA NA
NA.17 NA NA NA
NA.18 NA NA NA
NA.19 NA NA NA
NA.20 NA NA NA
NA.21 NA NA NA
NA.22 NA NA NA
NA.23 NA NA NA
NA.24 NA NA NA
701365.1 FALSE 2.354772e-01 3.44444444
701365.3 FALSE -3.000000e-01 3.44444444
701365.4 FALSE 3.585301e-02 3.44444444
NA.25 NA NA NA
NA.26 NA NA NA
NA.27 NA NA NA
701365.8 FALSE -2.321651e-02 -3.55555556
before the row.names
are changed.
The problem that I have is that this routine poses a bottleneck when working with larger datasets. It can take a day to run with the actual model fitting taking a fraction of the time. Is there anyway to speed this up?
After renaming the rows with rownames(mf) <- all.rn
the final data.frame would look like this:
701364.1 FALSE 1.191801e-02 11.88888889
701364.3 FALSE 2.715409e-01 7.88888889
701364.4 FALSE -3.202290e-02 4.88888889
701364.5 FALSE -1.940157e-01 -3.11111111
701364.6 FALSE 5.653818e-02 -4.11111111
701364.7 FALSE 2.081075e-02 -7.11111111
701364.8 FALSE -1.819507e-01 -8.11111111
701364.9 TRUE -1.491018e-01 -11.11111111
701364.2 NA NA NA
701364.12 NA NA NA
701364.10 NA NA NA
701364.11 NA NA NA
701364.13 NA NA NA
701364.14 NA NA NA
701364.15 NA NA NA
701364.16 NA NA NA
701364.17 NA NA NA
701364.18 NA NA NA
701364.19 NA NA NA
701364.20 NA NA NA
701364.21 NA NA NA
701364.23 NA NA NA
701364.24 NA NA NA
701364.22 NA NA NA
701364.27 NA NA NA
701364.28 NA NA NA
701364.30 NA NA NA
701364.37 NA NA NA
701364.38 NA NA NA
701364.39 NA NA NA
701364.25 NA NA NA
701364.26 NA NA NA
701364.29 NA NA NA
701365.1 FALSE 2.354772e-01 3.44444444
701365.3 FALSE -3.000000e-01 3.44444444
701365.4 FALSE 3.585301e-02 3.44444444
701365.5 NA NA NA
701365.6 NA NA NA
701365.7 NA NA NA
701365.8 FALSE -2.321651e-02 -3.55555556