train Gensim word2vec using large txt file

Question

I have a large txt file(150MG) like this

'intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one', 'better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', ...

I wanna train word2vec model model using that file but it gives me RAM problem.i dont know how to feed txt file to word2vec model.this is my code.i know that my code has problem but i don't know where is it.

import gensim 


f = open('your_file1.txt')
for line in f:
    b=line
   model = gensim.models.Word2Vec([b],min_count=1,size=32)

w1 = "bad"
model.wv.most_similar (positive=w1)

you are creating one model per line of the input file (f). This is not how you train a model. Read all the sentences and then train a model. — mujjiga, Mar 10 '19 at 10:40
yes, thats because you are creating too many model objects (one per line). As I mentioned in the above comment, the way you are training the model is wrong — mujjiga, Mar 10 '19 at 10:46
above code wont give error.it dosn't work at all.when i tried to give the whole file it gives me error. — , Mar 10 '19 at 11:07

score 4 · Answer 1 · answered Mar 10 '19 at 19:00

4

You can make an iterator that reads your file one line at a time instead of reading everything in memory at once. The following should work:

class SentenceIterator: 
    def __init__(self, filepath): 
        self.filepath = filepath 

    def __iter__(self): 
        for line in open(self.filepath): 
            yield line.split() 

sentences = SentenceIterator('datadir/textfile.txt') 
model = Word2Vec(sentences)

answered Mar 10 '19 at 19:00

Anna Krogager

3,528
16
23

I added this ( w1 = "good" model.wv.most_similar (positive=w1))at the end of your code but it gives me "word 'good' not in vocabulary" – Mar 11 '19 at 04:01
That must be a problem with your input data. Make sure that your data file is a text file with one sentence per line (written normally with spaces between words, not as a list of words). – Anna Krogager Mar 11 '19 at 07:37
is it ok if each sentence starts with open bracket and close with close bracket? – Mar 11 '19 at 09:33
Not with the code I wrote but you can of course change the code to fit your data. In the end you want `sentences` (i.e. the model input) to be an iterator of lists of words. In my code this is done by `line.split()` which splits each line of your file on spaces. – Anna Krogager Mar 11 '19 at 09:44
Can you instead just paste the first 5 lines or so in your question so I can see what the file looks like? – Anna Krogager Mar 12 '19 at 08:42

train Gensim word2vec using large txt file

1 Answers1

Linked