1

I am busy with a text analytic project on masses of complaints data. One of the issues with the data is that you get multiple synonyms of the same word, e.g. bill, billing, billed, bills etc. Normally I would create a word frequency list and manually match the obvious ones and then apply the main word back to the original corpus for every synonym instance, e.g. billing, billed, bills -> bill (as it is all bill related). I have a nifty piece of code that someone on here helped me with.

Recently I have been playing around with the idea of using a string distance algorithm to make my life easier by identifying possible synonyms. I am using the stringdist package, but I am at a loss as how to efficiently implement the test. Basically I need a matrix of all words and at the intersection a result of the stringdist function.

I use the stringdist function as follows:

library(stringdist)
1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)

Gives a similarity score of 0.955

So from a word list of a,b,c, I want to get to (values purely indicative):

   a    b    c
a  1    0.4  0.4
b  0.4  1    0.4
c  0.4  0.4  1

Where the intersection is the result of the stringdist function.

Alternatively I can also work with:

a  a  1
a  b  0.4
a  c  0.4
b  a  0.4
b  b  1
b  c  0.4
c  a  0.4
c  b  0.4
c  c  1

The only problem with the latter are the duplicates, e.g. a, b and b, a which could be eliminated as it yields the same result.

So clever R coders, please help me. I guess the answer is somewhere in matrix functions, but I am not a good enough R coder.

Cheers

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
RUser
  • 588
  • 1
  • 4
  • 17
  • Is it possible to order the words, and then make your table of pairs with the rule that the item in column 2 cannot be smaller/lower than column 1? – user2627717 Dec 11 '14 at 03:24
  • I doubt `stringdist` is vectorized so you are doomed to a slow loop. Assuming you have `n` words and that making `n*(n-1)/2` calls to the function is too slow, then you'll have to get creative in trying to reduce your problem size. For example, only work on sub-groups of words that start with the same letter. – flodel Dec 11 '14 at 03:35
  • If you want to use package `stringdist` then why not use `stringdistmatrix(...)`?? Also, if you are comfortable with Levenshtein distances, you could just use `adist(...)` in base R. – jlhoward Dec 11 '14 at 04:17
  • Might you identify some smallish number of key words and then use stringdistmatrix() to find all words within a minimum distance from one of those key words. Eg., key <- c("bill", "invoice", "statement", "charge"). That would shorten processing time. – lawyeR Dec 11 '14 at 11:14

2 Answers2

1

To remove the duplicates as described above:

dist.mat.tab.sort <- t(apply(dist.mat.tab, 1, sort))
dist.mat.tab <- dist.mat.tab[!duplicated(dist.mat.tab.sort),]

Where dist.mat.tab is the melted distance matrix

RUser
  • 588
  • 1
  • 4
  • 17
0

I suggest you use a stemmer, you will find it in tm package. If it is required to use a distance measurement then you can use cosine similarity rather than Jaro-winkler.