Kernel keeps dying when trying to build corpus in pandas

Question

I ran this code in the past and it worked fine. A couple of months later, it keeps causing the kernel to die.

I reinstalled and updated all conda/python related files. It doesn't seem to matter. It stalls out on the last line, and no error message is printed out.

It worked once, and failed 7 of last 8 times.

corpus = df['reviewText']

import nltk
import re
nltk.download('stopwords')

wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(corpus)

Happy to hear any suggestions or ideas. If there is some way to display an error, or reason for the kernel dying, please let me know.

Does the code run correctly on a sample data? If no, then add some sample data to the question to help us investigate the issue. If yes, then what's the size of your data? It may be a memory issue.. — Qusai Alothman, Aug 27 '18 at 11:25
I'm starting to realize that it's a memory issue. The kernel is dying in other cases when there is a lot to run. This particular dataset has 5 million book reviews. The code has worked with it before. I'll do it with a smaller dataset just to check. I'm having more and more kernel trouble lately. I would love tips on how to maximize the kernel. — CJW, Aug 28 '18 at 05:09

score 1 · Answer 1 · answered Oct 15 '18 at 08:21

1

This seems to help:

# Get rid of accumulated garbage
import gc
gc.collect()

answered Oct 15 '18 at 08:21

CJW

81
7

Kernel keeps dying when trying to build corpus in pandas

1 Answers1