I use the following code to clean my dataset and print all tokens (words).
with open(".data.csv", "r", encoding="utf-8") as file:
text = file.read()
text = re.sub(r"[^a-zA-Z0-9ß\.,!\?-]", " ", text)
text = text.lower()
nlp = spacy.load("de_core_news_sm")
doc = nlp(text)
for token in doc:
print(token.text)
When I execute this code with a small string it works fine. But when I use a 50 megabyte csv I get the message
Text of length 62235045 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
When I increase the limit to this size my computer gets hard problems.. How can I fix this? It can't be anything special to want to tokenize this amount of data.