I'm trying to train a word2vec model using a file with about 170K lines, with one sentence per line.
I think I may represent a special use case because the "sentences" have arbitrary strings rather than dictionary words. Each sentence (line) has about 100 words and each "word" has about 20 characters, with characters like "/"
and also numbers.
The training code is very simple:
# as shown in http://rare-technologies.com/word2vec-tutorial/
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
current_dir = os.path.dirname(os.path.realpath(__file__))
# each line represents a full chess match
input_dir = current_dir+"/../fen_output"
output_file = current_dir+"/../learned_vectors/output.model.bin"
sentences = MySentences(input_dir)
model = gensim.models.Word2Vec(sentences,workers=8)
Thing is, things work real quick up to 100K sentences (my RAM steadily going up) but then I run out of RAM and I can see my PC has started swapping, and training grinds to a halt. I don't have a lot of RAM available, only about 4GB and word2vec
uses up all of it before starting to swap.
I think I have OpenBLAS correctly linked to numpy: this is what numpy.show_config()
tells me:
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
lapack_info:
libraries = ['lapack']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
openblas_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
My question is: is this expected on a machine that hasn't got a lot of available RAM (like mine) and I should get more RAM or train the model in smaller pieces? or does it look like my setup isn't configured properly (or my code is inefficient)?
Thank you in advance.