Turning a sentence or a word into a vector is not different than doing so with documents, a sentence is just like a short document and a word is like a very very short one. From first link we have the code for mapping a document to a vector:
def makeVector(self, wordString):
""" @pre: unique(vectorIndex) """
#Initialise vector with 0's
vector = [0] * len(self.vectorKeywordIndex)
wordList = self.parser.tokenise(wordString)
wordList = self.parser.removeStopWords(wordList)
for word in wordList:
vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model
return vector
Same function can be used to map a sentence or a single word to a vector. Just pass them to this function. for a word, the result of wordList
would be an array holding a single value, something like: ["word"]
and then after mapping, the result vector would be a unit vector containing a 1
in associated dimension and 0
s elsewhere.
Example:
vectorKeywordIndex
(representing all words in vocabulary):
{"hello" : 0, "world" : 1, "this" : 2, "is" : 3, "me" : 4, "answer" : 5}
document "this is me"
: [0, 0, 1, 1, 1, 0]
document "hello answer me"
: [1, 0, 0, 0, 1, 1]
word "hello"
: [1, 0, 0, 0, 0, 0]
word "me"
: [0, 0, 0, 0, 1, 0]
after that similarity can be assessed through several criteria like cosine similarity using this code:
def cosine(vector1, vector2):
""" related documents j and q are in the concept space by comparing the vectors using the code:
cosine = ( V1 * V2 ) / ||V1|| x ||V2|| """
return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))
or by using scikit-learn's sklearn.metrics.pairwise.cosine_similarity.
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(x, y)