0

Here is what I am trying to do: When the term I am analyzing is "apples", I would like to know how many transpositions are needed to "apples" so that it can be found in a string.

"buy apples now" => 0 transposition needed (apples is present).

"cheap aples online" => 1 transposition is needed (apples to aples).

"find your ap ple here" => 2 transpositions are needed (apples to ap ple).

"aple" => 2 transpositions are needed (apples to aple).

"bananas" => 5 transpositions are needed (apples to bananas).

the stringdist and the adist functions don't work because they tell me how many transpositions are needed to transform one string into the other. Anyway, here is what I wrote so far:

#build matrix
a <- c(rep("apples",5),rep("bananas",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d<- data.frame(a,b)
colnames(d)<-c("term","string")

#count transpositions needed
d$transpositions <- mapply(adist,d$term,d$string)
print(d)
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
Julien Massardier
  • 1,326
  • 1
  • 11
  • 29

2 Answers2

0

you need to check for apples first and then do the transpositions

a <- c(rep("apples",5),rep("bananas",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d<- data.frame(a,b, stringsAsFactors = F)
colnames(d)<-c("term","string")

#check for apples first
d$apples <-grepl("apples", d$string)

#count transpositions needed
d$transpositions <- ifelse(d$apples ==FALSE, mapply(adist,d$term,d$string), 0)
print(d)
infominer
  • 1,981
  • 13
  • 17
  • hmm I just reread your question, will have to rethink my answer. I will post it when I deal with it later. How do you want to deal with sentences as opposed to one word transformations? – infominer Apr 03 '15 at 18:33
  • Tanks @infominer! much appreciated :) grepl is useful. The 1st step is, indeed, detecting the presence of the term spelled properly in the string. If the term spelled properly is not found, then I need to isolate the piece of the string that is the most similar to my term, and finally calculate the similarity between this piece of string and the term. Regarding sentences as opposed to "one word", I want to avoid that "buy aple now" gets a worse score than "aple" because of the extra words "buy and now". What matters is how similar the section "aple" of "buy aple now" is to the term "apple". – Julien Massardier Apr 03 '15 at 22:27
0

So, here is the dirty solution I came up with so far:

#create a data.frame
a <- c(rep("apples",5),rep("banana split",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d <- data.frame(a,b)
colnames(d) <- c("term","string")

#split the string into sequences of consecutive characters whose length is equal to the length of the term on the same row. Calculate the similarity to the term of each sequence of characters and identify the most relevant piece of string for each row.

mostrelevantpiece <- NULL

for (j in 1:length(d$string)){
  pieces<-NULL
  piecesdist<-NULL
  for (i in 1:max((nchar(as.character(d$string[j]))-nchar(as.character(d$term[j])))+1,1)){
    addpiece <- substr(d$string[j],i,i+nchar(as.character(d$term[j]))-1)
    dist <- adist(addpiece,d$term[j])
    pieces[i] <- str_trim(addpiece)
    piecesdist[i] <- dist
    mostrelevantpiece[j] <- pieces[which.min(piecesdist)]
  }
}

#calculate the number of transpositions needed to transform the "most relevant piece of string" into the term.

d$transpositionsneeded <- mapply(adist,mostrelevantpiece,d$term)
Julien Massardier
  • 1,326
  • 1
  • 11
  • 29