3

I ran this code in the past and it worked fine. A couple of months later, it keeps causing the kernel to die.

I reinstalled and updated all conda/python related files. It doesn't seem to matter. It stalls out on the last line, and no error message is printed out.

It worked once, and failed 7 of last 8 times.

corpus = df['reviewText']

import nltk
import re
nltk.download('stopwords')

wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(corpus)

Happy to hear any suggestions or ideas. If there is some way to display an error, or reason for the kernel dying, please let me know.

CJW
  • 81
  • 7
  • Does the code run correctly on a sample data? If no, then add some sample data to the question to help us investigate the issue. If yes, then what's the size of your data? It may be a memory issue.. – Qusai Alothman Aug 27 '18 at 11:25
  • I'm starting to realize that it's a memory issue. The kernel is dying in other cases when there is a lot to run. This particular dataset has 5 million book reviews. The code has worked with it before. I'll do it with a smaller dataset just to check. I'm having more and more kernel trouble lately. I would love tips on how to maximize the kernel. – CJW Aug 28 '18 at 05:09

1 Answers1

1

This seems to help:

# Get rid of accumulated garbage
import gc
gc.collect()
CJW
  • 81
  • 7