Efficiently deduplicate text matching

Question

struct Text {
    words: Vec<String>,
    ...
}

struct Input {
    words: Vec<String>,
    ...
}

I have a text processing application with multiple steps.

During one of the steps, I run JaroWinkler between the each word of text and input, pick the best matched Text words to the Input and get the average of their scores. I use this average in calculating the final result. This is a naive approach.

Now, the list of Text objects is over 120k and there are lot of duplicate words (300k all vs 60k unique).

Now I am spending a lot of time on this step. One thing that can be done here to improve time may be to find JaroWinkler between unique words and all input words separately and then use that in the step. But that is bad in terms of memory. What if input had 1000 words? I'll have to keep 1000 x 60k values in memory. Right now I am not holding anything in memory but paying in terms of CPU.

Is there a more efficient way to do this?

If you want to find the best-matching word from `text` for each word in `input`, you don't need to keep the JW scores for all combinations. You just need the best score for each input word. If there are duplicates in the input, you just look up that score. (This is called memoization.) Also, what does your input look like? Do you have many perfect matches with a JW similarity of 1? Then perform a dictionary lookup first and skip the JW calculation if the word is present. — M Oehm, Oct 10 '19 at 19:43
But overall best matching `text` word for a given `input` word is not the same as best matching word out of one `text` object's words right? — Gurwinder Singh, Oct 11 '19 at 03:17
I don't know, but I may have misunderstood what you want. What do you mean with "overall"? Do you have several text lists? As I understood it, you want to find the best match to the words in text for each word in the input, then sum the JW scores. Perhaps you could expand on your use case with an example? — M Oehm, Oct 11 '19 at 05:53

Efficiently deduplicate text matching

0 Answers0