I am trying to tokenize strings (rather long strings from 10-K reports) for a dataset with 56252 observations. However, the kernel keeps crashing usually about a quarter of the way through the dataset.
I tried:
- Running the .py file without Jupyter and receive the error:
zsh: killed
- Simply using
line[j].split(' ')
instead ofword_tokenize(line[j])
(below) - Reinstalling python and jupyter.
Nothing seems to have worked, therefore, any feedback would be appreciated.
from nltk.tokenize import word_tokenize
output = [[20120831,20120808,1199,1928175839, 'words section one report', 'words section onea report', 'words section seven report'],[20150621,20141231,1239,1124966666, 'more words fly kite big', 'different words compared section before', 'even more different words']]
item_1 = []
item_1a = []
item_7 = []
count = 0
for line in output:
try:
item_entry = []
item_tokens_to_add = []
for i in range(0, 4):
item_entry.append(line[i])
for j in range(4, 7):
line_tokens = word_tokenize(line[j])
item_tokens_to_add.append(line_tokens)
item_1.append(item_entry + item_tokens_to_add[0])
item_1a.append(item_entry + item_tokens_to_add[1])
item_7.append(item_entry + item_tokens_to_add[2])
count += 1
except:
pass
print(str(count) + '/' + str(len(output)))
Here is some information from the Jupyter log:
error 12:27:24.797: Disposing session as kernel process died ExitCode: undefined, Reason: /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/traitlets/traitlets.py:2202: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
warn(
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/traitlets/traitlets.py:2157: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '0a146f2b-abdf-428e-b31f-d08fde1c7026' instead of 'b"0a146f2b-abdf-428e-b31f-d08fde1c7026"'.
warn(
I have actually removed all punctuation in the text hence I don't know why I'm getting an error message describing "extra quotes". Also, the kernel crashes at different observations.