3

I have a large dataset containing a mix of words and short phrases, such as:

dataset = [
    "car",
    "red-car",
    "lorry",
    "broken lorry",
    "truck owner",
    "train",
    ...
]

I am trying to find a way to determine the most similar word from a short sentence, such as:

input = "I love my car that is red"   # should map to "red-car"
input = "I purchased a new lorry"     # should map to "lorry"
input = "I hate my redcar"            # should map to "red-car"
input = "I will use my truck"         # should map to "truck owner"
input = "Look at that yellow lorri"   # should map to "lorry"

I have tried a number of approaches to this with no avail, including:

Vectoring the dataset and the input using TfidfVectorizer, then calculating the Cosine similarity of the vectorized input value against each individual, vectorized item value from the dataset.

The problem is, this only really works if the input contains the exact word(s) that are in the dataset - so for example, in the case where the input = "trai" then it would have a cosine value of 0, whereas I am trying to get it to map to the value "train" in the dataset.

The most obvious solution would be to perform a simple spell check, but that may not be a valid option, because I still want to choose the most similar result, even when the words are slightly different, i.e.:

input = "broke"    # should map to "broken lorry" given the above dataset

If someone could suggest other potential approach I could try, that would be much appreciated.

  • You might want to consider the Levenshtein distances between pairs of words, since it seems you want to be able to predict a match even given an incorrectly spelled input – axolotl Jun 20 '18 at 15:40
  • On a similar note, use of the `nltk` package should allow you to find the stem words (e.g. broken, broke, break could all be mapped to a single stem word). – Tom Dalton Jun 20 '18 at 15:54
  • @Aalok I have tried the Levenshtein distance, which I should have mentioned, but it doesn't seem to be a valid option as in some cases the `dataset` may contain a longer sentence such as `the red car`, whereas the `input` may just be a single word such as `red`, and given the nature of the Levenshtein distance it's unlikely they'll ever be accurately mapped (especially because it's such a huge dataset). –  Jun 20 '18 at 16:10

2 Answers2

0

As @Aaalok has suggested in the comments, one idea is to use a different distance/similarity function. Possible candidates include

  • Levenshtein distance (measures the number of changes to transform one string into the other)
  • N-gram similarity (measures the number of shared n-grams between both strings)

Another possibility is feature generation, i.e. enhancing the items in your dataset with additional strings. These could be n-grams, stems, or whatever suits your needs. For example, you could (automatically) expand red-car into

red-car red car
Florian Brucker
  • 9,621
  • 3
  • 48
  • 81
  • 1. I did briefly test out the Levenshtein distance, but I am a bit confused as to how that could accurately map, because say you have a dataset with only three items `dataset = ['fly', 'red car', 'train']` and `input = 'car'`, I would want it to map to `'red car'` as that's the most similar, but it would instead map to `fly` as that only needs three changes. 2. I haven't looked at N-gram similarity, I'll take a look now. 3. Editing the dataset isn't an option unfortunately. –  Jun 20 '18 at 16:47
  • For feature generation you don't need to edit the dataset. You load the dataset and then augment it (automatically) using code (for example by adding stems). You then use the augmented dataset as input for your learning algorithm. – Florian Brucker Jun 20 '18 at 16:51
0

Paragraph vector or doc2vec should solve your problem. Provided you've enough and proper dataset. Of course, you'll have to do lot of tuning to get your results right. You could try gensim/deeplearning4j. But you may have to use some other methods to manage spelling mistakes.