I am training a model with gensim
, my corpus is many short sentences, and each sentence has a frequency which indicates times it occurs in total corpus. I implement it as follow, as you can see, I just choose to do repeat freq
times. Any way, if the data is small, it should work, but when data grows, the frequency can be very large, it costs too much memory and my machine cannot afford it.
So
1. can I just count the frequency in every record instead of repeat freq
times? 2. Or any other ways to save memory?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
corpus is something like this:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300