0

I have a column of line texts. From the column of line texts I would l names which are similar to a list of product names. I was using Doc2Vec to solve the problem. But my result has been pretty bad. Which is the right approach for this problem?

My data is as follows: LINE TEXT: pallets 10kg of chicken weldcote metals logistics 100th main, bolulvedour ave 19th main ST john 5670987

and my list of products which i am using to get the most similar names are mat_subset=[shoes of UK size 10, superdry trim, box of weight 10kgs, pallets etc.]

My line texts are my OCR output which are pretty decent. The Doc2Vec code which i used is as follows.

s_data=mat['LINETEXT']
line_txt = pd.DataFrame()
line_txt['sentences']=s_data
line_txt['sentences']=line_txt['sentences'].astype(str)
line_txt['tokenized_sents'] = line_txt.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)

sentences= []
for item_no, line in enumerate(line_txt['tokenized_sents'].values.tolist()):
    sentences.append(LabeledSentence(line,[item_no]))
# MODEL PARAMETERS   
dm = 1 # 1 for distributed memory(default); 0 for dbow 
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200

# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, # 
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter)

overalldf=[]
for line in mat_subset:
    infer_vector = model.infer_vector(line)
    similar_documents = model.docvecs.most_similar([infer_vector], topn = 10)
    df.columns=["sentence",'Similarity']
    overalldf.append(df)

final=pd.concat(overalldf)

This is the code, i have used. where mat_subset is my list of product names. I am pretty new to python, just correct me if im doing something wrong

anirudh
  • 1
  • 2

1 Answers1

0

Doc2Vec might work, if you have sufficient data, as might any number of other keyword-based or text-to-vector approaches (like representing products by sparse bag-of-words vectors).

But without knowing the limitations of your data, and whether whatever you've previously tried has been done right, and what your objective evaluation of "good enough" results would be, it's not possible to give specific answers.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • My column of line texts are basically OCR output from an image file. The problem with the line texts is in some cases the OCR output as random special characters. And so far I have tried word2vec and doc2vec in both cases where I have given the list of positive words as a list of product names. So, in cases of doc2vec my output does not throw the product names from line texts as the similar word. But in case of word2vec I get a list of product words which , but I along with it I also get some random junk words. – anirudh Jul 21 '17 at 10:00
  • If the OCR is highly buggy it will be hard to get good results from any technique. There's still not enough information (about your data or code) to know if your training/querying of Doc2Vec so far has been correct enough to expect results that'd be good for your needs. – gojomo Jul 21 '17 at 19:56