struct Text {
words: Vec<String>,
...
}
struct Input {
words: Vec<String>,
...
}
I have a text processing application with multiple steps.
During one of the steps, I run JaroWinkler between the each word of text and input, pick the best matched Text
words to the Input and get the average of their scores. I use this average in calculating the final result. This is a naive approach.
Now, the list of Text
objects is over 120k and there are lot of duplicate words (300k all vs 60k unique).
Now I am spending a lot of time on this step. One thing that can be done here to improve time may be to find JaroWinkler between unique words and all input words separately and then use that in the step. But that is bad in terms of memory. What if input had 1000 words? I'll have to keep 1000 x 60k values in memory. Right now I am not holding anything in memory but paying in terms of CPU.
Is there a more efficient way to do this?