R - How to count the occurence of a specific string for large textfiles

Question

I am trying to find the occurence of ~10.000 different locations in a list of emails. What I need is one vector with the most frequently mentioned location per eMail, one with the second most frequent and one with the third !

Since my dataset is huge, I have problems with the perfomrance. I tried it with stringi and the parallel package but it still runs very slowlx (about 15 min for 20.000 eMails and 10.000 locations). The input data (eMails and Cities) looks like this:

SearchVector = c('Berlin, 'Amsterdam', San Francisco', 'Los Angeles') ...
g$Message = c('This is the first mail from paris. Berlin is a nice place', 'This is the 2nd mail from San francisco. Beirut is a nice place to stay', 'This is the 3rd mail. Los Angeles is a great place') ...

Here is my code using stringi:

# libraries
library(doParallel)
library(stringi)

detectCores()
registerDoParallel(cores=7)
getDoParWorkers()

# function
getCount <- function(data, keyword)
{ 
  keyword2 = paste0( "^(", keyword, ")|(", keyword, ")$|[ ](", keyword, ")[ ]" )
  wcount <- stri_count(data, regex=keyword2)
  return(data.frame(wcount))
}

SearchVector = as.vector(countryList2)
Text = g$Message

cityName1 = character()
cityName2 = character()

result = foreach(i=Text, .combine=rbind, .inorder=FALSE, .packages=c('stringi'), .errorhandling=c('remove')) %dopar% 
{

  cities = as.data.frame(t(getCount(i, SearchVector)))
  colnames(cities) = SearchVector

  if ( length(cities[which(cities > 0)]) == 1 ) {
    cityName1 = names(sort(cities, decreasing = TRUE))[1]
    cityName2 = NA
  }
  else if ( length(cities[which(cities > 0)]) > 1 ) {
    cityName1 = names(sort(cities, decreasing = TRUE))[1]
    cityName2 = names(sort(cities, decreasing = TRUE))[2] 
  }

  else  {
    cityName1 = NA
    cityName2 = NA 

  }

  return(data.frame(cityName1, cityName2))
}


g$cityName1 = result[, 1]
g$cityName2 = result[, 2]

Any ideas how I can speed up this by, for instance, using an index or equal ? I really look forward to getting help on this issue.

Many thanks Clemens

Akhil Nair · Answer 1 · 2015-09-29T16:11:18.100

1

It's a bit too messy to comment this, but give this a shot:

library(data.table)
library(stringr)

dt = data.table(Text = g$Message, cleantext = tolower(g$Message))
dt[, place := str_extract_all(cleantext, paste0("(", paste(tolower(SearchVector), collapse = ")|("), ")"))]

Also your SearchVector in the question has some missing quotes.

data.table is usually lightning quick for things like this, but try it on a subset and see if it's acceptably fast.

The place column will look like a bunch of place names separated by commas, but internally it's a list so it's easy to do all sorts of aggregation with that like count places in each text, count how many time each place is mentioned etc.

dt[, n := lapply(place, length)]; dt
nplace = data.table(place = dt[, unlist(place)])[, .N, place]

I also changed all the text to lower case when doing the searching for good luck (this probably isn't the fastest way to be case insensitive but it just looks the most explicit to me).

edited Sep 29 '15 at 16:11

answered Sep 29 '15 at 15:23

Akhil Nair

3,144
1
17
32

I tested the code with 10 email and it is still working (more than 2 min now). Must be something strange going on... (2500 locations in "searchstring") – Clemens Sep 30 '15 at 09:08
Can you add a sample of those 10 emails? I just tried it with the 3 from the question and it ran in a couple seconds. I can imagine it might not be the appropriate solution if this is actually the case.. – Akhil Nair Sep 30 '15 at 09:54
Oh didn't see that. Okay, yeah that probably is a bit too long for string searching. I would actually take a different method of doing `txt = strsplit(..., " ")` on the text, and doing something like `SearchVector[SearchVector %in% txt]` as a function, then calling that through data table. Basically I don't think this has to be parallel-ised, and `%in%` will be a lot faster than string operations. – Akhil Nair Sep 30 '15 at 10:31
I tried that before, but it is hard because of locations with 2 or more words (san francisco etc). any ideas using indexes in data.table ? – Clemens Sep 30 '15 at 10:40
With 2 word places, tokenise them first i.e. search the place list for `" "` to get all the more than 1 word places, then loop over the text for each of these places with `gsub` or `str_replace` and replace `"a b"` with `"a@b"`. That'll be quick because it's only doing the string searching for really small amounts, and gets rid of that problem. Not sure what you'd plan to do with indexes. – Akhil Nair Sep 30 '15 at 10:45
You could loop over `mwp` with `Text = gsub(mwcs[i], gsub(" ", "@", mwcs[i]), Text)` where `mwp` is a vector of place names with a space in them. This has probably gone a bit off-topic though – Akhil Nair Sep 30 '15 at 11:28
`suchVectorTokenized = gsub(" ", "@", suchVector) for (i in 1:length(suchVector)) { Text = gsub(suchVector[i], suchVectorTokenized[i], Text) }` ... is still terribly slow. What can I do to speed it up ? Many thanks, Clemens – Clemens Oct 01 '15 at 11:54

R - How to count the occurence of a specific string for large textfiles

1 Answers1