Doc2vec beyond beginner guidance

Question

I've been using doc2vec in the most basic way so far with limited success. I'm able to find similar documents however often I get a lot of false positives. My primary goal is to build a classification algorithm for user requirements. This is to help with user requirement analysis and search.

I know this is not really a large enough dataset so there are a few questions I'd like help with:

How can a train on one set of documents and build vectors on another?
How do I go about tuning the model, specifically selecting the right number of dimensions for the vector space
How can I create a Hierarchical Clustering for the word vectors, should a do this with one model or should I create separate word and document classification models?
I don't have ground truth, this is unsupervised learning when tuning how do I measure the quality of the result?
And finally, are there any recommended online resource that might cover some of the above.

I've been calling train once with 100 vectors on 2000 documents, each with about 100 words, each document has 22 columns which are tagged by both cell and row.

def tag_dataframe(df, selected_cols):
    tagged_cells = []
    headers = list(df.columns.values)
    for index, row in df.iterrows():
        row_tag = 'row_' + str(index)
        for col_name in headers:
            if col_name in selected_cols:
                col_tag = 'col_' + col_name
                cell_tag = 'cell_' + str(index) + '_' + col_name
                cell_val = str(row[col_name])
                if cell_val == 'nan':
                    continue
                cleaned_text = clean_str(cell_val)
                if len(cleaned_text) == 0:
                    continue
                tagged_cells.append(
                    gensim.models.doc2vec.TaggedDocument(
                        cleaned_text,
                        [row_tag, cell_tag]))
    print('tagged rows')
    return tagged_cells

def load_or_build_vocab(model_path, tagged_cells):
    if os.path.exists(model_path):
        print('Loading vocab')
        d2vm = gensim.models.Doc2Vec.load(model_path)
    else:
        print('building vocab')
        d2vm = gensim.models.Doc2Vec(
            vector_size=100,
            min_count=0,
            alpha=0.025,
            min_alpha=0.001)
        d2vm.build_vocab(tagged_cells)
        print('    built')
        d2vm.save(model_path)
    return d2vm

def load_or_train_model(model_path, d2vm, tagged_cells):
    if os.path.exists(model_path):
        print('Loading Model')
        d2vm = gensim.models.Doc2Vec.load(model_path)
    else:
        print('Training Model')
        d2vm.train(
            tagged_cells,
            total_examples=len(tagged_cells),
            epochs=100)
        print('    trained')
        d2vm.save(model_path)
    return d2vm

What I hope to achieve is a set of document vectors which will help with finding similar user requirements from a free text and a Hierarchical Clustering to build navigation of the existing requirements.

The reason I'm not using existing vocab is that the text has a number of domain-specific acronyms which I'm hoping to classify. — Philip Wilson, Mar 25 '19 at 12:00

score 3 · Answer 1 · answered Mar 25 '19 at 15:57

You should look at the doc2vec- Jupyter notebooks bundled with gensim in its docs/notebooks directory (or viewable online) for more examples of proper use. Looking through existing SO answers on the tag doc2vec (and perhaps especially my answers) may also give you an idea of common mistakes.)

To tune the model in an unsupervised setting, you essentially need some domain-specific repeatable evaluation score. This might require going through your whole clustering & end-application, then counting its success on certain results it "should" give for a hand-created subset of your data.

For comparison, if you look at the original 'Paragraph Vector' paper, it used existing batches of top-10 search-results snippets from an existing search engine as the training documents, but then scored any model by how well it put snippets that were in a shared-top-10 closer to each other than to random 3rd documents. The followup paper 'Document Embedding with Paragraph Vectors' trained on Wikipedia articles or Arxiv papers, and tuned their model based on how well the resulting model put documents into the same pre-curated categories that exist on those systems.

You can use any clustering algorithms on the per-document vectors. The output of Doc2Vec, as a document-per-vector, can become the input of downstream algorithms. (I'm not sure what you mean about "separate word and document classification models". You've only described document-level final needs, you might not need word-vectors at all... though some Doc2Vec modes will create such vectors.)

You use the infer_vector() method to create vectors for novel documents, after the model has been trained and frozen.

Looking at the specifics of your data/code, some observations:

it's not clear what your multiple columns are, or that they should be separate docs (as opposed to coalesced into one doc). Seeing some full rows might help make clear the essence of your data.
that's a tiny dataset - most published Doc2Vec work operates on tens-of-thousands to millions of documents. This algorithm works best with more data.
the original work gave each document just a single unique-ID tag. While gensim Doc2Vec supports giving documents multiple tags, as you've done here, it's best considered an advanced technique. It essentially dilutes what can be learned from a doc across the multiple tags, which could weaken the results, especially in small datasets.
10-20 training epochs is most common in published work, though more may especially be helpful for smaller datasets. It's best to set the epochs on the model initialization, as well, as that value will also be the default used for future infer_vector() operations (unless another value is explicitly passed there).
the structure of your two methods is a bit odd - saving an untrained model, but then perhaps training and overwriting it right away? (Or are you just trying to re-use a saved model with pre-built vocabulary for multiple training runs with different data?)
Word2Vec and Doc2Vec often do better discarding rare words (with the default min_count=5 or larger when practical) than trying to train on them. Words that only appear one or a few times are often idiosyncratic in their usage, compared to the "true" importance of the word in the larger world. Keeping them makes models larger, slower to train, and more likely to reflect idiosyncracies of the data than generalizable patterns.

Doc2vec beyond beginner guidance

1 Answers1