Doc2vec on a corpus of novels: how do I assign to each sentence of a novel one tag for the ID of the sentence and one tag for the ID of the book?

Question

I am trying to train a doc2vec model on a corpus of six novels and I need to build the corpus of Tagged Documents. Each novel is a txt file, already preprocessed and read into python using the read() method, so that it appears as a "long string". If I try to tag each novel using TaggedDocument form gensim, each novel gets only one tag, and the corpus of tagged documents has only six elements (which is not enough to train the doc2vec model).

I have been suggested to split each novel into sentences, then assign each sentence one tag for the ID of the sentence, and then one tag for the ID of the book it belongs to. I am, however, in trouble since I do not know how to structure the code.

This was the first code, i.e. the one using each novel in the format of a "long string":

    `documents=[emma_text, persuasion_text, prideandprejudice_text,   
     janeeyre_text, shirley_text, professor_text] 
     corpus=[]`

    `for docid, document in enumerate(documents):
         corpus.append(TaggedDocument(document.split(), tags=
         ["{0:0>4}".format  
         (docid)]))`    

     `d2v_model = Doc2Vec(vector_size=100, 
                window=15,
                hs=0,
                sample=0.000001,
                min_count=100,
                workers=-1,
                epochs=500,
                dm=0, 
                dbow_words=1) 

    d2v_model.build_vocab(corpus)`

    `d2v_model.train(corpus, total_examples=d2v_model.corpus_count,    
     epochs=d2v_model.epochs)`

This, however, means that my corpus of tagged documents has only six elements and that my model has not enough elements on which to train. If for instance I try to apply the .most_similar method to a target book, I get completely wrong results

To sum up, I need help to assign each sentence of each book (I have already split the books into sentences) one tag for the ID of the sentence and one tag for the ID of the book it belongs to, using TaggedDocument to build the corpus on which I will train my model.

Thanks for the attention!

To better make suggestions: What is your end goal in using `Doc2Vec` to create a vector-model of the full books (or portions thereof, like chapters, paragraphs or sentences)? Is it specifically the challenge of breaking the full docs into smaller chunks that you need help with? — gojomo, Mar 27 '19 at 16:58

Doc2vec on a corpus of novels: how do I assign to each sentence of a novel one tag for the ID of the sentence and one tag for the ID of the book?

0 Answers0