Enhance the execution time of data frame column transformation code

Question

I have a data frame having more than a million rows. It has a key column having key values as character. This key column has around 900 distinct values. A number of these values are minor variations of a standard value. Out of these 900 values, approx 175 of them are to be mapped to standard values. The following sample code explains how did I get the mapping done to correct the values. Here "Event 1" value needs to be replaced by "evt 1":

id = c(1:4)
k1 = c("Event 1", "evt 1", "evt 2", "evt 3")
v1 = c(101:104)
df = data.frame(id, k1, v1)
df$k1 = as.character(df$k1)

### map the non-standard values to standard values using named vector approach
mapEvents = c("Event 1" = "evt 1")
vNames = names(mapEvents)

stTime = proc.time()
df$k1 = ifelse(df$k1 %in% vNames, mapEvents[df$k1], df$k1)
proc.time() - stTime

This code works ok BUT with a serious performance issue. The ifelse code takes around 9 minutes to complete on my i7 system.

How I make this mapping to execute in fastest possible way? Appreciate the help very much.

You should investigate the `data.tables` package. There is somewhat of a learning curve, but it speeds up things like this quite a bit. — John Paul, May 16 '14 at 18:48
Based on @John's recommendation, I solved the above problem with data.table and shared my answer below. But I don't know how to indicate that as an answer. All the other approaches were data specific where as data.table approach is more generic and can be used in various scenarios. — kishore, Jul 03 '14 at 02:38

Henrik · Answer 1 · 2014-05-17T19:46:41.170

2

If the 'only' replacement you wish to do is from "Event" to "evt", but for a large number of combinations with different serial numbers, it may more convenient to replace the string only. Without knowing more about your data it's hard to tell though.

library(stringi)
stri_replace_first_fixed(str = k1, pattern = "Event", replacement = "evt")
# [1] "evt 1" "evt 1" "evt 2" "evt 2" "evt 3" "evt 3"

edited May 17 '14 at 19:46

answered May 16 '14 at 22:00

Henrik

65,555
14
143
159

score 0 · Answer 2 · answered May 16 '14 at 21:09

df$k1[df$k1 %in% "Event 1"] <- "evt 1"

If you have multiple values you need to remap:

remap <- list("evt 1" = "Event 1",
              "evt 2" = c('a','b','c'),
              "evt 3" = c('x','y','z'),
              ...etc...)

for(i in seq_along(remap)) {
    w <- which(df$k1 %in% remap[[i]])
    df$k1[w] <- names(remap)[i]
}

score 0 · Accepted Answer · edited May 23 '17 at 11:50

Based on the SO question Update subset of data.table based on join, below is the one answer.

## convert the mapping vector to a data table
  dtMap = data.table(idMap=names(mapEvents), mappedValue=mapEvents)
  setkey(dtMap, idMap)

  ## convert the data frame to a data table
  dt_df = data.table(df)
  setkey(dt_df, k1)

  ## do the update and convert the data table back to data frame
  dt_df[dtMap[dt_df[k1 %in% dtMap$idMap],nomatch=0],c("k1"):=list(i.mappedValue)]
  df = as.data.frame(dt_df)

data.table is just marvelous. Need to explore it and use it more. Thanks.

Enhance the execution time of data frame column transformation code

3 Answers3