I've been using doc2vec in the most basic way so far with limited success. I'm able to find similar documents however often I get a lot of false positives. My primary goal is to build a classification algorithm for user requirements. This is to help with user requirement analysis and search.
I know this is not really a large enough dataset so there are a few questions I'd like help with:
- How can a train on one set of documents and build vectors on another?
- How do I go about tuning the model, specifically selecting the right number of dimensions for the vector space
- How can I create a Hierarchical Clustering for the word vectors, should a do this with one model or should I create separate word and document classification models?
- I don't have ground truth, this is unsupervised learning when tuning how do I measure the quality of the result?
- And finally, are there any recommended online resource that might cover some of the above.
I've been calling train once with 100 vectors on 2000 documents, each with about 100 words, each document has 22 columns which are tagged by both cell and row.
def tag_dataframe(df, selected_cols):
tagged_cells = []
headers = list(df.columns.values)
for index, row in df.iterrows():
row_tag = 'row_' + str(index)
for col_name in headers:
if col_name in selected_cols:
col_tag = 'col_' + col_name
cell_tag = 'cell_' + str(index) + '_' + col_name
cell_val = str(row[col_name])
if cell_val == 'nan':
continue
cleaned_text = clean_str(cell_val)
if len(cleaned_text) == 0:
continue
tagged_cells.append(
gensim.models.doc2vec.TaggedDocument(
cleaned_text,
[row_tag, cell_tag]))
print('tagged rows')
return tagged_cells
def load_or_build_vocab(model_path, tagged_cells):
if os.path.exists(model_path):
print('Loading vocab')
d2vm = gensim.models.Doc2Vec.load(model_path)
else:
print('building vocab')
d2vm = gensim.models.Doc2Vec(
vector_size=100,
min_count=0,
alpha=0.025,
min_alpha=0.001)
d2vm.build_vocab(tagged_cells)
print(' built')
d2vm.save(model_path)
return d2vm
def load_or_train_model(model_path, d2vm, tagged_cells):
if os.path.exists(model_path):
print('Loading Model')
d2vm = gensim.models.Doc2Vec.load(model_path)
else:
print('Training Model')
d2vm.train(
tagged_cells,
total_examples=len(tagged_cells),
epochs=100)
print(' trained')
d2vm.save(model_path)
return d2vm
What I hope to achieve is a set of document vectors which will help with finding similar user requirements from a free text and a Hierarchical Clustering to build navigation of the existing requirements.