I am trying to find the occurence of ~10.000 different locations in a list of emails. What I need is one vector with the most frequently mentioned location per eMail, one with the second most frequent and one with the third !
Since my dataset is huge, I have problems with the perfomrance. I tried it with stringi and the parallel package but it still runs very slowlx (about 15 min for 20.000 eMails and 10.000 locations). The input data (eMails and Cities) looks like this:
SearchVector = c('Berlin, 'Amsterdam', San Francisco', 'Los Angeles') ...
g$Message = c('This is the first mail from paris. Berlin is a nice place', 'This is the 2nd mail from San francisco. Beirut is a nice place to stay', 'This is the 3rd mail. Los Angeles is a great place') ...
Here is my code using stringi:
# libraries
library(doParallel)
library(stringi)
detectCores()
registerDoParallel(cores=7)
getDoParWorkers()
# function
getCount <- function(data, keyword)
{
keyword2 = paste0( "^(", keyword, ")|(", keyword, ")$|[ ](", keyword, ")[ ]" )
wcount <- stri_count(data, regex=keyword2)
return(data.frame(wcount))
}
SearchVector = as.vector(countryList2)
Text = g$Message
cityName1 = character()
cityName2 = character()
result = foreach(i=Text, .combine=rbind, .inorder=FALSE, .packages=c('stringi'), .errorhandling=c('remove')) %dopar%
{
cities = as.data.frame(t(getCount(i, SearchVector)))
colnames(cities) = SearchVector
if ( length(cities[which(cities > 0)]) == 1 ) {
cityName1 = names(sort(cities, decreasing = TRUE))[1]
cityName2 = NA
}
else if ( length(cities[which(cities > 0)]) > 1 ) {
cityName1 = names(sort(cities, decreasing = TRUE))[1]
cityName2 = names(sort(cities, decreasing = TRUE))[2]
}
else {
cityName1 = NA
cityName2 = NA
}
return(data.frame(cityName1, cityName2))
}
g$cityName1 = result[, 1]
g$cityName2 = result[, 2]
Any ideas how I can speed up this by, for instance, using an index or equal ? I really look forward to getting help on this issue.
Many thanks Clemens