I have a large dataset containing a mix of words and short phrases, such as:
dataset = [
"car",
"red-car",
"lorry",
"broken lorry",
"truck owner",
"train",
...
]
I am trying to find a way to determine the most similar word from a short sentence, such as:
input = "I love my car that is red" # should map to "red-car"
input = "I purchased a new lorry" # should map to "lorry"
input = "I hate my redcar" # should map to "red-car"
input = "I will use my truck" # should map to "truck owner"
input = "Look at that yellow lorri" # should map to "lorry"
I have tried a number of approaches to this with no avail, including:
Vectoring the dataset
and the input
using TfidfVectorizer, then calculating the Cosine similarity of the vectorized input
value against each individual, vectorized item value from the dataset
.
The problem is, this only really works if the input
contains the exact word(s) that are in the dataset - so for example, in the case where the input = "trai"
then it would have a cosine value of 0, whereas I am trying to get it to map to the value "train"
in the dataset.
The most obvious solution would be to perform a simple spell check, but that may not be a valid option, because I still want to choose the most similar result, even when the words are slightly different, i.e.:
input = "broke" # should map to "broken lorry" given the above dataset
If someone could suggest other potential approach I could try, that would be much appreciated.