memory error when using sklearn feature extraction

Question

I've been using a python script to tokenize and calculate the TFIDF for a lot of .txt files, my script is as follow :

import nltk
import string
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import nltk
import string
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib
import re
import scipy.io
import glob

path = 'R'
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

for subdir, dirs, files in os.walk(path):
 for file in files:
    #if re.match("text\d+.txt",file):
      #with open(os.path.join(path,file),'r') as f:
       #for shakes in f:

        remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters
        remove_num = re.compile('[\d]+')
        file_path = subdir + os.path.sep + file
        shakes = open(file_path, encoding="utf8")

        text = shakes.read()
        lowers = text.lower()
        a1 = lowers.translate(string.punctuation)
        a2 = remove_spl_char_regex.sub(" ",a1)  # Remove special characters
        a3 = remove_num.sub("", a2)  #Remove numbers
        token_dict[file] = a3

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(token_dict.values())
scipy.io.savemat('arrdata4.mat', mdict={'arr': tfs})

Depending on the size of the file I encounter a MemoryError after 30 minutes.
Anyone can explain to me how can I increase the memory that python has access to or any other way that I can solve this problem with? .

Where are you getting MemoryError? While generating `tfidf` or while `fit_transform`? — Gurupad Hegde, Nov 24 '14 at 04:10
I am getting the error while it generating the fit-transform — user3789843, Nov 24 '14 at 04:15
It seems like the issue reported in http://stackoverflow.com/a/22006707/522719. Did you try splitting the data while fitting? — Gurupad Hegde, Nov 24 '14 at 16:19

score 1 · Answer 1 · answered Nov 23 '14 at 14:37

Python doesn't have a memory limit beyond what the OS imposes.

Make sure you're not limiting the process's memory usage with ulimit or equivalent.
Run top and see if the process uses all available memory.
Then you'll either have to decrease then memory your program needs, or increase the RAM/Swap it gets access to.

memory error when using sklearn feature extraction

1 Answers1