How to efficiently balance an unbalanced R dataframe

Question

I'm using a statistical library mlogit which has an inefficient routine to create a balanced data.frame from an unbalanced one. With my particular dataset an intermediate data.frame mf is produced with several 100,000 rows. The problematic line of code is:

mf <- mf[all.rn, ]

all.rn is a character vector used to index the data.frame of the form:

"701364.1"  "701364.3"  "701364.4"  "701364.5"  "701364.6"  "701364.7"  
"701364.8"  "701364.9"  "701364.2"  "701364.12"
"701364.10" "701364.11" "701364.13" "701364.14" "701364.15" "701364.16" 
"701364.17" "701364.18" "701364.19" "701364.20"
"701364.21" "701364.23" "701364.24" "701364.22" "701364.27" "701364.28" 
"701364.30" "701364.37" "701364.38" "701364.39"
"701364.25" "701364.26" "701364.29" "701365.1"  "701365.3"  "701365.4"  
"701365.5"  "701365.6"  "701365.7"  "701365.8" 
"701365.9"  "701365.2"  "701365.12" "701365.10" "701365.11" "701365.13" 
"701365.14" "701365.15" "701365.16" "701365.17"

which consists of two numbers. The first corresponds to a particular event and the second to the choices available for that event which varies between events. Not all choices are available.

The original mf data.frame looks like this:

701364.1           FALSE           1.191801e-02    11.88888889
701364.3           FALSE           2.715409e-01     7.88888889
701364.4           FALSE          -3.202290e-02     4.88888889
701364.5           FALSE          -1.940157e-01    -3.11111111
701364.6           FALSE           5.653818e-02    -4.11111111
701364.7           FALSE           2.081075e-02    -7.11111111
701364.8           FALSE          -1.819507e-01    -8.11111111
701364.9            TRUE          -1.491018e-01   -11.11111111
701365.1           FALSE           2.354772e-01     3.44444444
701365.2            TRUE           1.141553e-01     3.44444444
701365.3           FALSE          -3.000000e-01     3.44444444
701365.4           FALSE           3.585301e-02     3.44444444
701365.8           FALSE          -2.321651e-02    -3.55555556
701367.1           FALSE           2.154056e-01     5.20000000
701367.2           FALSE          -7.043655e-03     2.20000000

while the resulting data.frame after the routine is balanced with missing choices filed in with NAs:

701364.1             FALSE           1.191801e-02    11.88888889
701364.3             FALSE           2.715409e-01     7.88888889
701364.4             FALSE          -3.202290e-02     4.88888889
701364.5             FALSE          -1.940157e-01    -3.11111111
701364.6             FALSE           5.653818e-02    -4.11111111
701364.7             FALSE           2.081075e-02    -7.11111111
701364.8             FALSE          -1.819507e-01    -8.11111111
701364.9              TRUE          -1.491018e-01   -11.11111111
NA                      NA                     NA             NA
NA.1                    NA                     NA             NA
NA.2                    NA                     NA             NA
NA.3                    NA                     NA             NA
NA.4                    NA                     NA             NA
NA.5                    NA                     NA             NA
NA.6                    NA                     NA             NA
NA.7                    NA                     NA             NA
NA.8                    NA                     NA             NA
NA.9                    NA                     NA             NA
NA.10                   NA                     NA             NA
NA.11                   NA                     NA             NA
NA.12                   NA                     NA             NA
NA.13                   NA                     NA             NA
NA.14                   NA                     NA             NA
NA.15                   NA                     NA             NA
NA.16                   NA                     NA             NA
NA.17                   NA                     NA             NA
NA.18                   NA                     NA             NA
NA.19                   NA                     NA             NA
NA.20                   NA                     NA             NA
NA.21                   NA                     NA             NA
NA.22                   NA                     NA             NA
NA.23                   NA                     NA             NA
NA.24                   NA                     NA             NA
701365.1             FALSE           2.354772e-01     3.44444444
701365.3             FALSE          -3.000000e-01     3.44444444
701365.4             FALSE           3.585301e-02     3.44444444
NA.25                   NA                     NA             NA
NA.26                   NA                     NA             NA
NA.27                   NA                     NA             NA
701365.8             FALSE          -2.321651e-02    -3.55555556

before the row.names are changed. The problem that I have is that this routine poses a bottleneck when working with larger datasets. It can take a day to run with the actual model fitting taking a fraction of the time. Is there anyway to speed this up?

After renaming the rows with rownames(mf) <- all.rn the final data.frame would look like this:

701364.1           FALSE           1.191801e-02    11.88888889
701364.3           FALSE           2.715409e-01     7.88888889
701364.4           FALSE          -3.202290e-02     4.88888889
701364.5           FALSE          -1.940157e-01    -3.11111111
701364.6           FALSE           5.653818e-02    -4.11111111
701364.7           FALSE           2.081075e-02    -7.11111111
701364.8           FALSE          -1.819507e-01    -8.11111111
701364.9            TRUE          -1.491018e-01   -11.11111111
701364.2              NA                     NA             NA
701364.12             NA                     NA             NA
701364.10             NA                     NA             NA
701364.11             NA                     NA             NA
701364.13             NA                     NA             NA
701364.14             NA                     NA             NA
701364.15             NA                     NA             NA
701364.16             NA                     NA             NA
701364.17             NA                     NA             NA
701364.18             NA                     NA             NA
701364.19             NA                     NA             NA
701364.20             NA                     NA             NA
701364.21             NA                     NA             NA
701364.23             NA                     NA             NA
701364.24             NA                     NA             NA
701364.22             NA                     NA             NA
701364.27             NA                     NA             NA
701364.28             NA                     NA             NA
701364.30             NA                     NA             NA
701364.37             NA                     NA             NA
701364.38             NA                     NA             NA
701364.39             NA                     NA             NA
701364.25             NA                     NA             NA
701364.26             NA                     NA             NA
701364.29             NA                     NA             NA
701365.1           FALSE           2.354772e-01     3.44444444
701365.3           FALSE          -3.000000e-01     3.44444444
701365.4           FALSE           3.585301e-02     3.44444444
701365.5              NA                     NA             NA
701365.6              NA                     NA             NA
701365.7              NA                     NA             NA
701365.8           FALSE          -2.321651e-02    -3.55555556

I'd like to recreate the same output as the `mf <- mf[all.rn, ]` routine above but in a fraction of the time. So I would balance the original `mf` data.frame by adding in additional rows filled with NAs. However, this takes far too long at the moment. — user2699676, Feb 08 '18 at 15:59
The logic of what `mf <- mf[all.rn, ]` is not clear to me. Could you indicate how adding seemingly arbitrary numbers of `NA` rows balances your data? — CPak, Feb 08 '18 at 16:00
It's used internally within the `mlogit` package in it's calculations. I'm trying to speed up the package for my individual use case because there's a massive bottleneck with the `mf <- mf[all.rn, ]` call. — user2699676, Feb 08 '18 at 16:04
Well the columns are irrelevant apart from the index I guess as they are predictive variables. They will change depending on the model being fit by `mlogit`. In some respects, this is really all about efficiently indexing data.frames in such a manner that missing rows get replaced with NAs. — user2699676, Feb 08 '18 at 16:30
In looking at your problem, I see an inconsistency in `all.rn` vs. `mf`. The value "701365.2" is present in both, but the process eliminates this value along with the data in that line. Is this the behavior that you want to see? The alternative would be to sort `all.rn`, thereby making certain that the data in that line were retained. In addition, "701364.2" is present in `all.rn` but is not processed into a line with `NA`s. This, too, would be taken care of by sorting `all.rn`. — Edward Carney, Feb 08 '18 at 20:51
I believe that "701365.2" is still kept. What I want is for all possible combinations of the first and second numbers making up the index to have rows. Rows which don't exist in the original data.frame should be inserted as a row of NAs as above. The call `mf <- mf[all.rn, ]` does this but is very inefficient. — user2699676, Feb 09 '18 at 18:46

How to efficiently balance an unbalanced R dataframe

0 Answers0