0

My code is running out of memory because of the question I asked in this page. Then, I wrote the second code to have an iterable alldocs, not an all-in-memory alldocs. I changed my code based on the explanation of this page. I am not familiar with stream concept and I could not solve the error I got.

This code read all files of all folders of given path.The context of each file is consist of a document name and its context in two lines.For instance:

clueweb09-en0010-07-00000

dove gif clipart pigeon clip art picture image hiox free birds india web icons clipart add stumble upon

clueweb09-en0010-07-00001

google bookmarks yahoo bookmarks php script java script jsp script licensed scripts html tutorials css tutorials

First code:

# coding: utf-8
 import string
 import nltk
 import nltk.tokenize 
 from nltk.corpus import stopwords
 import re
 import os, sys 

 import MySQLRepository

 from gensim import utils
 from gensim.models.doc2vec import Doc2Vec
 import gensim.models.doc2vec
 from gensim.models.doc2vec import LabeledSentence
 from boto.emr.emrobject import KeyValue


 def readAllFiles(path):
    dirs = os.listdir( path )
    for file in dirs:
        if os.path.isfile(path+"/"+file):
           prepareDoc2VecSetting(path+'/'+file)
       else:
           pf=path+"/"+file
           readAllFiles(pf)      

def prepareDoc2VecSetting (fname):
    mapDocName_Id=[]
    keyValues=set()
   with open(fname) as alldata:
        a= alldata.readlines()
        end=len(a)
        label=0
        tokens=[]
        for i in range(0,end):
            if a[i].startswith('clueweb09-en00'):
               mapDocName_Id.insert(label,a[i])
               label=label+1
               alldocs.append(LabeledSentence(tokens[:],[label]))
               keyValues |= set(tokens)
               tokens=[]
           else:
               tokens=tokens+a[i].split()  

   mydb.insertkeyValueData(keyValues) 

   mydb.insertDocId(mapDocName_Id)


   mydb=MySQLRepository.MySQLRepository()

  alldocs = [] 
  pth='/home/flr/Desktop/newInput/tokens'
  readAllFiles(ipth)

  model = Doc2Vec(alldocs, size = 300, window = 5, min_count = 2, workers = 4)
  model.save(pth+'/my_model.doc2vec')

Second code:(I did not consider parts related to DB)

import gensim
import os


from gensim.models.doc2vec import Doc2Vec
import gensim.models.doc2vec
from gensim.models.doc2vec import LabeledSentence



class prepareAllDocs(object):

    def __init__(self, top_dir):
        self.top_dir = top_dir

    def __iter__(self):
    mapDocName_Id=[]
    label=1
    for root, dirs, files in os.walk(top_directory):
        for fname in files:
            print fname
            inputs=[]
            tokens=[]
            with open(os.path.join(root, fname)) as f:
                for i, line in enumerate(f):          
                    if line.startswith('clueweb09-en00'):
                        mapDocName_Id.append(line)
                        if tokens:
                            yield LabeledSentence(tokens[:],[label])
                            label+=1
                            tokens=[]
                    else:
                        tokens=tokens+line.split()
                yield LabeledSentence(tokens[:],[label])

pth='/home/flashkar/Desktop/newInput/tokens/'
allDocs = prepareAllDocs('/home/flashkar/Desktop/newInput/tokens/')
for doc in allDocs:
    model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, workers = 4)
model.save(pth+'/my_model.doc2vec')

This is the error:

Traceback (most recent call last): File "/home/flashkar/git/doc2vec_annoy/Doc2Vec_Annoy/KNN/testiterator.py", line 44, in model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, >workers = 4) File "/home/flashkar/anaconda/lib/python2.7/site->packages/gensim/models/doc2vec.py", line 618, in init self.build_vocab(documents, trim_rule=trim_rule) File >"/home/flashkar/anaconda/lib/python2.7/site->packages/gensim/models/word2vec.py", line 523, in build_vocab self.scan_vocab(sentences, progress_per=progress_per, >trim_rule=trim_rule) # initial survey File "/home/flashkar/anaconda/lib/python2.7/site->packages/gensim/models/doc2vec.py", line 655, in scan_vocab for document_no, document in enumerate(documents): File >"/home/flashkar/git/doc2vec_annoy/Doc2Vec_Annoy/KNN/testiterator.py", line 40, in iter yield LabeledSentence(tokens[:],tpl1) IndexError: list index out of range

user3092781
  • 313
  • 2
  • 16
  • Your 'second code' is on the right track, but: (1) you're still appending every `line` to `mapDocName_Id` - so bringing everything into one in-memory list; (2) it's impossible for `tokens` to ever be non-empty where you're testing it, because it's just been set to `[]` before every loop iteration- so you'll never yield anything; (3) you're now passing a single tuple into LabeledSentence, rather than the two lists it expects; (4) you don't need to loop over `alldocs` yourself, when it's working right, you just pass `alldocs` into Doc2Vec once. – gojomo Feb 22 '17 at 16:51

1 Answers1

1

You are using a generator function because you don't want to store all of your documents, but you are still storing all of your documents in alldocs. You can just yield LabeledSentence(tokens[:], tpl[1]])).

What is currently happening is you are appending to a list and returning the list. this is why you are getting the AttributeError. Additionally, on each iteration you are appending to the list, which means that on each iteration, i, you are returning i and all documents that came before i!

aberger
  • 2,299
  • 4
  • 17
  • 29
  • I update my code but I got the new error.Also, could you explain more about second part of your answer.I could not get it. – user3092781 Feb 21 '17 at 18:55
  • I think you should first try to debug your code and come to SO when you really get stuck with a specific issue. To point you in the right direction, you should merge iter_document with iter in prepareAllDocs to make things simpler. You should also check out enumerate in Python so you don't need to read all your data in with readlines() and you don't have to increment label in iter_document. – aberger Feb 21 '17 at 19:09
  • I try to debug but I could find the problem. You mean I read line by line instead of _readlines()_ why? – user3092781 Feb 21 '17 at 19:36
  • I could not use _enumerate_ because the lable of each file is not the same as index in the following code `for index, line in enumerate(fname):` – user3092781 Feb 21 '17 at 19:51
  • Right, I missed that, but you can still use it so you don't have to keep indexing a. – aberger Feb 21 '17 at 20:01
  • @ I change my code based on your suggestion but the _allDocs_ is null now. How can I call __iter__ – user3092781 Feb 21 '17 at 20:33