I have an orders dataset that contains sales order and sales order line information. Below is a screen shot of the first few columns of data:
Where sales order is the unique but can have multiple sales order line's per sales order. 20% of the data is what we call remakes, which is identified because the sales order number won't match the column for Original Order number. We are trying to build a prediction model to predict whether a model will be returned or not. Unfortunately there's 3 columns (width, height and fabric number not shown) that have NA's for the sales orders that were remakes. I'm trying to impute those NA's with the values of the original order number.
This is the code I have:
for (i in 1:length(hd$SALES_ORDER)){
if (is.na(hd$WIDTH[i]) == TRUE){
hd$WIDTH[i] = hd$WIDTH[hd$ORIGINAL_ORDER[i] == hd$SALES_ORDER][1]
}
}
1 being the first value returned since there could be multiple sales lines. I attempted to match sales order line and original order line but kept getting a 'value length' error.
My data has 3 million+ rows and 400k na's. The for loop is running but it's been running for an hour. I'm curious if there's a more efficient way to accomplish my task?
Thanks