0

I want to form an ngram using nltk corpus reuters. I tested my code to form ngrams on a small corpus saved on my local disk as a text file using:

import nltk
file = open('dummytext.txt', encoding = 'utf8').read()

Now that my ngram probability code makes sense to me. I want to use the nltk corpus reuters which is a huge corpus so when i do the following:

import nltk
from nltk.corpus import reuters
file = reuters.words()

The processing to form unigrams goes on for eternity

How to unpack the nltk corpus as string in a variable to form ngrams using nltk?

James Z
  • 12,209
  • 10
  • 24
  • 44
  • Why do you think a text file that you'd have to split into words yourself is faster than a dedicated file for it. Loading the words from `reuters` and counting them takes about `1.9` seconds on my machine, if your time is significantly off I expect your algorithm to be suboptimal or the problem you're trying to solve is just a hard one. Anyways without some code we can't exactly help you figure out what's so slow. – cafce25 Nov 18 '22 at 18:27
  • I am very new to nlp and python as well so i dont know if this si the right way to do it. This is what i did: import nltk from nltk.corpus import reuters reuters_dataset=" ".join(reuters.words()) from nltk import word_tokenize import re corpus=re.sub(r'[^\w\s]',"",reuters_dataset) corpus_new=corpus.lower() wor=nltk.word_tokenize(corpus_new) words=wor[:30000] #Sliced the dataset for reducing processing complexity () I want to use the whole corpus not slice it. Let me know if there is a better way. – Swapnil Nov 19 '22 at 22:19
  • Please add the code as edit to the original question since in comments it gets horribly mangled. – cafce25 Nov 19 '22 at 22:24

0 Answers0