0

Simple problem, but I couldn't find a solution: How to replace all elements in a dataframe not contained in a vector with a specific string?

My dataframe looks like this:

ID <- sample(1:8)
Country <- c("USA", "RUS", "Unknown", "Not specified", "???", "XXX", "FRA", "ITA")
myDF <- data.frame(ID, Country)

I also have a vector that contains all possible country codes:

countryCodes <- c("ESP", "FRA", "ITA", "GBR", "DEU", "USA", "RUS", "BRA", "KOR", "BLZ", "BLR", "BEL", "TWN", "CHN")

I would like to replace all elements in myDF$Country not contained in countryCodes with "N/D".

The dataset I'm working with has around 30 million rows and I have to perform several transformations, so I'd like to keep the code simple and as quick as possible.

Thanks in advance!

guillem
  • 23
  • 3

1 Answers1

2

I'd use the data.table package for that data size and operation:

library(data.table)
setDT(myDF)             # convert to data.table
myDF[!J(countryCodes), on = "Country", Country := "N/D"]
setDF(myDF)             # ..optional, to convert back to data.frame

This uses a pretty efficient join and update by reference.

talat
  • 68,970
  • 21
  • 126
  • 157