Pulling information from the first approximate match of a text string in R (and summing the total number of matches)

Question

I'm having trouble summing approximate matches of text strings, as well as pulling information from the string that was matched first in time.

I have data that look like this:

text<-c("THEN it goes West","AT it falls East","it goes West", "it falls East", "AT it goes West")
date<-c(2008,2009,2003,2006,2011)
ID<-c(1,2,3,4,5)
data<-cbind(text,date,ID)
data<-as.data.frame(data)

Notice that the latest text strings have all-caps "THEN" and "AT" added to the earlier text strings.

I would like a table that looks like this:

     ID  Sum Originaltext     Originaldate
[1,] "4" "3" "it goes West"      "2003"      
[2,] "2" "2" "it falls East"     "2006"

This includes:

The ID number corresponding with the text with the earliest date (the "original" text that the others were derived from). Sums of all approximate matches for each. The text corresponding with the earliest date. And the date of the text corresponding with the earliest date.

I have tens of millions of cases, so I'm having trouble automating the process.

I run Windows 7, and have access to fast-computing servers.

IDEAS

#order them backwards in time
data<-data[order(data$date, decreasing = TRUE),]

#find the strings with the latest date

pattern<-"AT|THEN"

k <- vector("list", length(data$text))

 for (j in 1:length(data$text)){
     k[[j]]<- grep(pattern,data$text[[j]], ignore.case=FALSE)
}

k<-subset(data$text, k==1)

k<-unique(k)

#this is a problem, because case nos. 1 and 5 are still in the dataset, but they derive from the same tweet.

From here, I can use "agrep", but I'm not sure in what context. Any help would be greatly appreciated!

NOTE: While the three answers below do answer my question the way I originally asked it, I have not mentioned that my text cases do vary even without the words "AT" and "THEN". In fact, most of them do not match exactly. I should have put this in the original question. However, I would still love an answer.

Thanks!

mnel · Accepted Answer · 2012-10-16T00:19:01.957

4

A data.table solution avoiding stringr. I am sure this could be improved

Dealing with text data

# make the factor columns character
.data <- lapply(data, function(x) if(is.factor(x)) {as.character(x)} else { x})
library(data.table)
DT <- as.data.table(.data)


DT[, original_text := text]
# using `%like% which is an easy data.table wrapper for grepl
DT[text %like% "^THEN", text := substr(text, 6, nchar(text))]
DT[text %like% "^AT", text :=  substr(text, 4, nchar(text))]

# or avoiding the two vector scans and replacing in one fell swoop
DT[,text := gsub('(^THEN )|(^AT )', '', text)]

DT[, c(sum=.N, .SD[which.min(date)]) ,by=text]

using factor levels (could be faster)

# assuming that text is a factor
DTF <- as.data.table(data) 
DTF[, original_text := text]
levels_text <- DTF[, levels(text)]
new_levels <- gsub('(^THEN )|(^AT )', x= levels_text ,'')
# reset the levels
setattr(DTF[['text']], 'levels', new_levels)
# coerce to character and do the same count / min date
DTF[, c(sum=.N, .SD[which.min(date)]) ,by=list(text = as.character(text))]

edited Oct 16 '12 at 00:19

answered Oct 15 '12 at 23:28

mnel

113,303
27
265
254

nice, I try to avoid stringr but sometimes can't help it. – Maiasaura Oct 15 '12 at 23:51
`stringr` is nice that it will coerce factor to character, here it seemed easier to convert to character and use base functionality. (Although perhaps working on the levels of the factor here would be faster) – mnel Oct 15 '12 at 23:57
@mnel, you mean like `%like%`? – GSee Oct 16 '12 at 00:03
This is great, but I failed to mention that even after taking out "AT" and "THEN", most of my cases do not have exact matches of texts. I think this will work up until the last line, where I get the minimum date and sum the matches, because I'm grouping by "text". Will this grouping function be able to pick out fuzzy matches? – R-Enthusiast Oct 19 '12 at 19:51

Tyler Rinker · Answer 2 · 2012-10-15T21:52:11.380

I'm going to give you a base solution but I really think this is a big problem for base and the data.table package is what is needed (but I don't know how to use data.table very well:

dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
    c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]), 
        Original.Date = as.character(x[1, 2]))
}

data.frame(do.call(rbind, lapply(dat2, FUN)), row.names = NULL)

I don't really know how close each text string is so maybe my exact matching is not appropriate but if that's the case use agrep to develop a new variable. Sorry for the lack of annotations but I am pressed for time and I think data.table is more appropriate anyway.

EDIT: I still think that data.table is better and should be out the door but maybe running in parallel is smart. You're on a windows machine so this would work to use multiple cores of a computer:

dat <- data[order(data$date), ]
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
dat$text2 <- Trim(gsub("AT|THEN", "", dat$text))
dat2 <- split(dat, dat$text2)
FUN <- function(x) {
    c(ID = x[1, 3], Sum = nrow(x), Original.Text = as.character(x[1, 1]), 
        Original.Date = as.character(x[1, 2]))
}

library(parallel)
detectCores()  #make sure you have > 1 core

cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("FUN", "dat2"), envir=environment())
x <- parLapply(cl, dat2, FUN)
stopCluster(cl)  #stop the cluster
data.frame(do.call(rbind, x), row.names = NULL)

Thanks so much for posting this. I've tried the 'split' function, but unfortunately I did not mention that many of my text cases are actually not exact matches even after taking out "AT" and "THEN", in fact, most of them are only fuzzy matches. Is there a way to use the split function to do fuzzy matching? — R-Enthusiast, Oct 19 '12 at 19:54
btw upvote for answering my question precisely how I asked it. — R-Enthusiast, Oct 19 '12 at 19:55

Maiasaura · Answer 3 · 2012-10-15T23:16:28.103

plyr might be too slow given the number of records you mention, but here is a solution for you:

library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))

result <- ddply(data, .(text), function(x) {
     sum <- nrow(x)
     x <- x[which(x$date==min(x$date)),]
    return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
    })

> result[, -1]
  id Sum  Originaltext Originaldate
1  4   2 it falls East         2006
2  3   3  it goes West         2003

If you have access to a multicore machine (4 or more cores), then here is a HPC solution

library(multicore)
library(stringr)
data$original_text <- data$text
data$text[grep("^THEN", data$text)] <- str_trim(str_sub(data$text[grep("^THEN", data$text)],6))
data$text[grep("^AT", data$text)] <- str_trim(str_sub(data$text[grep("^AT", data$text)],4))

fux <- function(foo) {
     sum <- nrow(x)
     x <- x[which(x$date==min(x$date)),]
    return(data.frame(id=unique(x$ID), Sum = sum, Originaltext = unique(x$original_text), Originaldate = unique(x$date)))
}

x <- split(data, data$text)
result <- mclapply(x, fux, mc.cores = 4, mc.preschedule = TRUE)

Thanks a lot for answering my question. Unfortunately, I did not explain that many of my text cases are not exact matches, so the 'unique' function may not pick up on all the "fuzzy" matches. Is there a way to use the 'unique' function to do fuzzy matching of texts? Btw I gave this an upvote because this answered my question perfectly the way I asked it. — R-Enthusiast, Oct 20 '12 at 22:33

Pulling information from the first approximate match of a text string in R (and summing the total number of matches)

3 Answers3

Dealing with text data

using factor levels (could be faster)