-1

i don't know how to train model in multiples batches with doc2vec . Since i load all my data in ram and it't can not be loaded

#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
#import ReadExeFileCapstone
import update-doc2vec 
mapData = ReadExeFileCapstone.readData()

# print ('mapData', mapData)

max_epochs = 10000
vec_size = 200
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha,
                min_alpha=0.00025,
                min_count=1,
                dm =1)
data = []
for key in mapData:
    listData = mapData[key]
    # print ("listData: ", len(listData), listData)

    for i in range(len(listData)):
        listToStr = ' '.join([str(elem) for elem in listData[i]]) #convert array to list string
        data.append(listToStr)

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]


model.build_vocab(tagged_data)
#build vocab
for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha
# train model   
model.save("d2v_ASM.model")
print("Model Saved")
gojomo
  • 52,260
  • 14
  • 86
  • 115

1 Answers1

0

Doc2Vec (& similar model classes in gensim) don't require the full training data to be a list-in-memory. They will accept a Python 'iterable' object, which simply provides items one-at-a-time, repeatedly.

Such an iterable object can stream items from some other source, like a large file on disk – even a file that's far larger than available RAM.

It's not clear what your ReadExeFileCapstone utility class is doing. (There are no web hits for code with this name.) But, it can likely be changed to itself return an iterable object that, every time it is iterated over, returns each text, one-at-a-time, from the original data source. You can then wrap that in your code for creating the necessary TaggedDocument objects, again as an iterable rather than a full-list-in-memory.

A reasonable intro to this technique is available at:

https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Separately:

  • 10000 epochs is an absurdly large count compared to published work, which usually uses 10-20 epochs, sometimes more for very small datasets. (But, such small datasets are less likely to get good results from Doc2Vec-like algorithms, which need lots of varied data. However, if you're having memory problems, your dataset is probably not tiny.)

  • Don't call train() multiple times in a loop where you hand-tamper with the alpha value. This is unnecessarily and error prone – and in fact your current code is erroneous, because subtracting 0.0002 from your 0.025 starting-alpha thousands of times will drive alpha negative, a nonsensical & destructive value. Call train() once with your desired number of epochs - it will do the right thing. And it's rare to need to tune the default alpha values at all.

If you want more progress output – or just to better understand what's happening at each step – enable logging at the INFO level.

gojomo
  • 52,260
  • 14
  • 86
  • 115