Imputing NA's with millions of rows of data in R

Question

I have an orders dataset that contains sales order and sales order line information. Below is a screen shot of the first few columns of data:

Where sales order is the unique but can have multiple sales order line's per sales order. 20% of the data is what we call remakes, which is identified because the sales order number won't match the column for Original Order number. We are trying to build a prediction model to predict whether a model will be returned or not. Unfortunately there's 3 columns (width, height and fabric number not shown) that have NA's for the sales orders that were remakes. I'm trying to impute those NA's with the values of the original order number.

This is the code I have:

for (i in 1:length(hd$SALES_ORDER)){
  if (is.na(hd$WIDTH[i]) == TRUE){
    hd$WIDTH[i] = hd$WIDTH[hd$ORIGINAL_ORDER[i] == hd$SALES_ORDER][1]
  }
}

1 being the first value returned since there could be multiple sales lines. I attempted to match sales order line and original order line but kept getting a 'value length' error.

My data has 3 million+ rows and 400k na's. The for loop is running but it's been running for an hour. I'm curious if there's a more efficient way to accomplish my task?

Thanks

Can you provide a reprex? Also, what version of R are you using? — C Jeruzal, Feb 17 '21 at 22:40
Tried using reprex but got the error: #> Warning in file(file, "rt"): cannot open file 'Quality Data/Hunter Douglas #> Quality Data.csv': No such file or directory #> Error in file(file, "rt"): cannot open the connection. Rstudio is 1.3.9 — Josh Ortega, Feb 17 '21 at 23:10
Hi Josh. To get that reproducible example try to get a sample of your dataset that includes some of those NA's. `dput(dplyr::sample_n(tbl = your_df, n = 50))` might be a good start. — Nicolás Velasquez, Feb 17 '21 at 23:40
Is your current solution correct but inefficient? Or does the loop just run for a long time and not return anything? — Bill O'Brien, Feb 17 '21 at 23:54
I know it will work because I tried it on one row to impute the NA, but when I run the loop it just runs. R studio doesn't freeze, i think it's just taking a long time to run. I might just have to run the for loops for the 3 columns I need and let it run overnight. I just didn't know if there was a more efficient solution? — Josh Ortega, Feb 17 '21 at 23:59
@NicolásVelásquez I tried running your command. I've never ran dplyr before but it gave me the error Error in sample.int(n(), check_size(~, n(), replace = replace), replace = replace, : invalid 'size' argument — Josh Ortega, Feb 18 '21 at 18:06
@BillO'Brien My solution works for one line but when I try to run it on the whole data set R studio just runs forever. I tried leaving it run over night and it still didn't finish. Not sure if my original code needs an 'else' statement. I'd like the data but I have an NDA in place. — Josh Ortega, Feb 18 '21 at 18:07
Hi Josh. maybe try dput(your_df[1:20, ] ), especially if you have relevant cases within the first 20 observations/rows. — Nicolás Velasquez, Feb 18 '21 at 22:42

Jon Spring · Accepted Answer · 2021-02-22T03:25:15.020

1

This seems unusually slow. Even without any optimization (e.g. using data.table), the approach below only takes a couple seconds to take a 2M row data frame and fill in NAs for 1 million rows from the preceding order with the same ORIGINAL_ORDER.

library(dplyr); library(tidyr)
my_data_million <- data.frame(stringsAsFactors = FALSE, # not necessary for R >4.0.0
                      ORIGINAL_ORDER = rep(1:1000000, 2),
                      SALES_ORDER = 1000000:2999999,
                      WIDTH = c(sample(1:50, 1000000, replace = TRUE), rep(NA, 1000000))
) %>%
slice_sample(n = 2E6, replace = FALSE)   # Shuffling just to show it's still fast


my_data_million %>%
  arrange(ORIGINAL_ORDER, SALES_ORDER) %>%
  group_by(ORIGINAL_ORDER) %>%
  tidyr::fill(WIDTH, .direction = "updown") %>%    #EDIT
  ungroup()

edited Feb 22 '21 at 03:25

answered Feb 18 '21 at 00:53

Jon Spring

55,165
4
35
53

Thanks for the code @JonSpring. I tried running it on my dataset after converting the "NULL" to na's and it ran faster but it didn't change anything. I still have all the NA's. The issue I am having is for your example above, sales order number != original order number and in my data set it does, since it is a mix of remakes and non remakes. When Original_Order = Sales_Order (non-remake) but when Original_Order != Sales_Order (this is a remake). The original order data is apart of the dataset as is the remake, however the remake row is missing values that the original order had. – Josh Ortega Feb 18 '21 at 18:14
I guess I had assumed that your SALES_ORDERS would be chronological so that the non-make would always come first, but perhaps that's not the case. I've added `.direction = "updown"` to the `fill` line above, which should help in cases where that assumption isn't true. Does that help? – Jon Spring Feb 18 '21 at 18:24
Thanks @jon. That did work...apparently there's Null values in the Original Order column where we don't know the OG sales number for the remake – Josh Ortega Feb 18 '21 at 18:55

Imputing NA's with millions of rows of data in R

1 Answers1