Imputing Missing Values Using values from a different Row in the same Column in R

Question

I have the above table. I would like to fill in the missing values under Transaction ID. The algorithm for filling this would be as follows:

User ID "kenn1" has two missing Transaction IDs, and this can be filled using the other two Transaction IDs t1 and t4.
To choose which one to use between t1 and t4, I look at the Event Time. The first missing value happens at 9:30, and it is 30 minutes away from t1 and 20 minutes away from t4. Since t4 is closer to this missing value it would be filled as t4. Similarly for the missing value in row 4, it is 45 minutes away from t1 and 5 minutes away from t4. It would therefore be replaced with t4.
Similar approach for missing values for User ID "kenn2"

How do I do this in R?

score 0 · Answer 1 · answered May 20 '17 at 14:43

Probably there is a better solution, but I wrote this solution with data.table:

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                UserId = c(rep("kenn1",4), rep("kenn2",6)),
                EventTime = as.POSIXct(
                  c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                    "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
                )

transactionTimes <- raw[!is.na(TransactionID), .(TransactionID, EventTime)]
raw[, Above := na.locf(TransactionID, na.rm = F), UserId]
raw[, Below := na.locf(TransactionID, na.rm = F, fromLast = T), UserId]
raw <- merge(raw, transactionTimes[, .(Above = TransactionID, AboveTime = EventTime)], by="Above", all.x = T)
raw <- merge(raw, transactionTimes[, .(Below = TransactionID, BelowTime = EventTime)], by="Below", all.x = T)
raw[, AboveDiff := EventTime - AboveTime]
raw[, BelowDiff := BelowTime - EventTime]
raw[is.na(TransactionID) & is.na(AboveDiff), TransactionID := Below]
raw[is.na(TransactionID) & is.na(BelowDiff), TransactionID := Above]
raw[is.na(TransactionID), TransactionID := ifelse(AboveDiff <= BelowDiff, Above, Below)]
raw <- raw[, .(Event, TransactionID, UserId, EventTime)]
rm(transactionTimes)

score 0 · Answer 2 · edited May 23 '17 at 12:34

Another solution with data.table.

library(data.table)
#Create Data Table, You can read.csv or read.xlsx etc
raw <- data.table(Event = paste0("e", 1:10),
                  TransactionID = c("t1",NA,NA,"t4",NA,"t5","t6",NA,NA,"t8"),
                  UserId = c(rep("kenn1",4), rep("kenn2",6)),
                  EventTime = as.POSIXct(
                    c("2017-05-20 9:00", "2017-05-20 9:30", "2017-05-20 9:45", "2017-05-20 9:50", "2017-05-20 10:01",
                      "2017-05-20 10:02", "2017-05-20 10:03","2017-05-20 10:04","2017-05-20 10:05","2017-05-20 10:06")
                    , format="%Y-%m-%d %H:%M")
)

#subset a rows without duplicates
raw_notNA <- raw[!is.na(TransactionID)] 
# merge the subset data with original (this will duplicate rows of originals with candiate rows)
merged <- merge(raw, raw_notNA, all.x = T, by = "UserId", allow.cartesian=TRUE) 
# calcuate time difference between original and candiate rows
merged[, DiffTime := abs(EventTime.x - EventTime.y)]
# create new Transaction IDs from the closest event 
merged[, NewTransactionID := TransactionID.y[DiffTime == min(DiffTime)], by = Event.x]
# remove the duplicaetd rows, and delete unnecesary columns
output <- merged[, .SD[1], by = Event.x][, list(Event.x, NewTransactionID, UserId, EventTime.x)]

names(output) <- names(raw)
print(output)

Inspired by answers to this question (your question is not a duplicate, just similar)

R - merge dataframes on matching A, B and *closest* C?

Hi, Thank you for the answer. I tried replicating this logic in my parent dataset, and I am getting an error: "Cannot allocate Vector of size 198MB". Could there be a workaround for this? Regards — Kenneth Singh, May 22 '17 at 04:55
In which line, you are getting that problem? Probably, it is because of cartesian merge. If it is, you can try the other solution that I proposed, since it doesn't use cartesian merge. — Sabri Karagönen, May 22 '17 at 06:51
I fixed this error by removing all my open programs. However there is an issue here - there are some Transaction IDs which are still blank. These look like the cases in which there is only one Transaction ID value, i.e. there is no case of comparing the time difference since there is only one Transaction ID for the User ID. How do I fix this? — Kenneth Singh, May 22 '17 at 12:11

Imputing Missing Values Using values from a different Row in the same Column in R

2 Answers2