I have a column of line texts. From the column of line texts I would l names which are similar to a list of product names. I was using Doc2Vec to solve the problem. But my result has been pretty bad. Which is the right approach for this problem?
My data is as follows: LINE TEXT: pallets 10kg of chicken weldcote metals logistics 100th main, bolulvedour ave 19th main ST john 5670987
and my list of products which i am using to get the most similar names are mat_subset=[shoes of UK size 10, superdry trim, box of weight 10kgs, pallets etc.]
My line texts are my OCR output which are pretty decent. The Doc2Vec code which i used is as follows.
s_data=mat['LINETEXT']
line_txt = pd.DataFrame()
line_txt['sentences']=s_data
line_txt['sentences']=line_txt['sentences'].astype(str)
line_txt['tokenized_sents'] = line_txt.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
sentences= []
for item_no, line in enumerate(line_txt['tokenized_sents'].values.tolist()):
sentences.append(LabeledSentence(line,[item_no]))
# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200
# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, #
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter)
overalldf=[]
for line in mat_subset:
infer_vector = model.infer_vector(line)
similar_documents = model.docvecs.most_similar([infer_vector], topn = 10)
df.columns=["sentence",'Similarity']
overalldf.append(df)
final=pd.concat(overalldf)
This is the code, i have used. where mat_subset is my list of product names. I am pretty new to python, just correct me if im doing something wrong