5

In the context of an internship project, I have to perform a tfidf analyse over a large set of files (~18000). I am trying to use the TFIDF vectorizer from sklearn, but I'm facing the following issue : how can I avoid loading all the files at once in memory ? According to what I read on other posts, it seems to be feasible using an iterable, but if I use for instance [open(file) for file in os.listdir(path)] as the raw_documents input to the fit_transform() function, I am getting a 'too many open files' error. Thanks in advance for you suggestions ! Cheers ! Paul

paupaulaz
  • 937
  • 1
  • 11
  • 20
  • have you tried genism tfidf model.? – Sreeram TP Jul 19 '18 at 12:29
  • Try ‘[open(fn).read() for fn in os.listdir(path)]’ ? This load all the file to memory at once. But memory is rarely a problem with text data. – phi Jul 19 '18 at 12:36
  • @SreeramTP I am trying to stick to sklearn for now, but I'll give it a try if I don't find a solution, thanks ;) – paupaulaz Jul 19 '18 at 12:40
  • Okay. Genism model is much memory efficient and fast. you can know more from here : https://radimrehurek.com/gensim/intro.html – Sreeram TP Jul 19 '18 at 12:43
  • @phi the problem with this is that it would load all of the txt from the files at once in memory to create that list, but I think the exact same syntax with parenthesis intstead of brackets might work since it creates a generator I think : '‘(open(fn).read() for fn in os.listdir(path))’'. I am currently trying ! Thanks :) – paupaulaz Jul 19 '18 at 12:43

1 Answers1

5

Have you tried input='filename' param in TfidfVectorizer? Something like this:

raw_docs_filepaths = [#List containing the filepaths of all the files]

tfidf_vectorizer =  TfidfVectorizer(`input='filename'`)
tfidf_data = tfidf_vectorizer.fit_transform(raw_docs_filepaths)

This should work, because in this, the vectorizer will open a single file at a time, when processing that. This can be confirmed by cross-checking the source code here

def decode(self, doc):
...
...
    if self.input == 'filename':
        with open(doc, 'rb') as fh:
            doc = fh.read()
...
...
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • thanks a lot for taking the time to answer ! This was probably exactly what I was looking for ! In the meantime, I found a hacky way to do it, by giving a generator and not list as an input to fit_transform : `file_objects = (open(os.path.join(txt_train_path, file)).read() for file in train_files if file.endswith("txt")` – paupaulaz Jul 21 '18 at 11:05
  • need you help in this `https://stackoverflow.com/questions/51633775/store-tf-idf-matrix-and-update-existing-matrix-on-new-articles-in-pandas#51634394`? – Learner Aug 01 '18 at 19:50