0

I'm new to python and need help with NLTK language modeling.

I'm trying to generate the setence starting with "he said" using trigram model but get the following error:

Traceback (most recent call last):
  File "C:\Users\PycharmProjects\homework3 3\main.py", line 77, in <module>
    suffix = pick_word(d[prefix])
  File "C:\Users\PycharmProjects\homework3 3\main.py", line 71, in pick_word
    return random.choice(sents)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2288.0_x64__qbz5n2kfra8p0\lib\random.py", line 378, in choice
    return seq[self._randbelow(len(seq))]
IndexError: list index out of range

I don't understand why it's complaining the list index is out of range. What I think it should be doing is taking the reuters sentence and should pick a word from it randomly and pass it as suffix

Heres the whole code, please only focus on the trigram portion as he rest is incomplete

# imports
import string
import random

import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters, stopwords
from collections import defaultdict
from nltk import FreqDist, ngrams

# input the reuters sentences
sents = reuters.sents()

# write the removal characters such as : Stopwords and punctuation
stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation + '"' + '"' + '-' + '''+''' + '—'
removal_list = list(stop_words) + list(string.punctuation) + ['lt', 'rt']

# generate unigrams bigrams trigrams
unigram = []
trigram = []
tokenized_text = []

for sentence in sents:
    sentence = list(map(lambda x: x.lower(), sentence))
for word in sentence:
    if word == '.':
        sentence.remove(word)
    else:
        unigram.append(word)

tokenized_text.append(sentence)
trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))

# remove the n-grams with removable words
def remove_stopwords(x):
    y = []
    for pair in x:
        count = 0
        for word in pair:
            if word in removal_list:
                count = count or 0
            else:
                count = count or 1
        if (count == 1):
            y.append(pair)
    return (y)

trigram = remove_stopwords(trigram)

# generate frequency of n-grams
freq_tri = FreqDist(trigram)

d = defaultdict(list)

#Trigrams
for a, b, c in freq_tri:
    if (a != None and b != None and c != None):
        d[a, b].extend([c] * freq_tri[a,b,c])
#        print(" d[a, b].extend([c] * freq_tri[a,b,c]) ",  d[a, b].extend([c] * freq_tri[a,b,c]))

#Next word prediction
s = ''

def pick_word(sents):
    "Chooses a random element."
    return random.choice(sents)

prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
    suffix = pick_word(d[prefix])

What am I doing wrong? Am I assuming wrong that I'm passing the reuters sentence to choose a word randomly and doing something wrong?

I thought maybe I was choosing the wrong list to pass in the pick_word function and tried to use tokenized_text. I receive the same error so I think my asumption or understand of this is wrong. I'm not sure which part of it is wrong.

  • The traceback shows an error raised at the line `return random.choice(sents)`, but that line does not exist in your code sample. Make sure the code in your question is the same as the code that causes the bug you want help with. And please simplify the code to the shortest possible that still gives you this error. [mcve] – Håken Lid Nov 08 '22 at 18:40
  • Ok. I've simplified the code and made sure it's the same code. – Jesper Ezra Nov 08 '22 at 19:02

0 Answers0