0

I'm making a discord bot that spurts out randomly generated sentences into the chat every few seconds. Im trying to use the nltk module to make the sentences more coherent, but I'm caught up on an error and cant figure it out.

import asyncio
import random
import discord.ext.commands
import markovify
import nltk
import re

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = ["::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

with open("/root/sample.txt") as f:
    text = f.read()

text_model = POSifiedText(text, state_size=1)

client = discord.Client()
async def background_loop():
    await client.wait_until_ready()
    while not client.is_closed:
        channel = client.get_channel('channelid')
        messages = [(text_model.make_sentence(tries=8, max_overlap_total=10,default_max_overlap_ratio=0.5))]
        await client.send_message(channel, random.choice(messages))
        await asyncio.sleep(10)

client.loop.create_task(background_loop())
client.run("token")

heres the error from the output log:

Traceback (most recent call last):
  File "/root/untitled/Loop.py", line 21, in <module>
    text_model = POSifiedText(text, state_size=1)
  File "/usr/local/lib/python3.5/dist-packages/markovify/text.py", line 24, in __init__
    runs = list(self.generate_corpus(input_text))
  File "/root/untitled/Loop.py", line 11, in word_split
    words = [": :".join(tag) for tag in nltk.pos_tag(words) ]
  File "/usr/local/lib/python3.5/dist-packages/nltk/tag/__init__.py", line 129, in pos_tag
    return _pos_tag(tokens, tagset, tagger)    
  File "/usr/local/lib/python3.5/dist-packages/nltk/tag/__init__.py", line 97, in _pos_tag
    tagged_tokens = tagger.tag(tokens)
  File "/usr/local/lib/python3.5/dist-packages/nltk/tag/perceptron.py", line 152, in tag
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "/usr/local/lib/python3.5/dist-packages/nltk/tag/perceptron.py", line 152, in <listcomp>
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "/usr/local/lib/python3.5/dist-packages/nltk/tag/perceptron.py", line 227, in normalize
    elif word[0].isdigit():
IndexError: string index out of range
Museman
  • 7
  • 1
  • 6
  • If `word[0].isdigit():` throws that error, `word` is sometimes the empty string. – John Coleman Mar 07 '17 at 00:34
  • To add to my previous comment -- there are a lot of function calls in that traceback. If the empty word entered in somewhere downstream (in code that you are simply calling) then it might be difficult to debug. On the other hand, maybe the fix is as simple as filtering empty strings out of `re.split(self.word_split_pattern, sentence)`. – John Coleman Mar 07 '17 at 00:43
  • `words = [w for w in words if len(w) > 0]` before you pass it to `nltk.pos_tag()`. I'm just guessing, but it seems like it can't hurt to try. – John Coleman Mar 07 '17 at 00:45
  • That fixes the error, but makes the generated sentences have no spaces. – Museman Mar 07 '17 at 00:51
  • Fixed the problem. Added some spaces to `words = [" :: ".join(tag) for tag in nltk.pos_tag(words) ]` – Museman Mar 07 '17 at 00:57
  • I'll post it as an answer then, in case other people run into that problem. Since the spacing issue is a separate question which you have already answered, I won't include that in the answer. – John Coleman Mar 07 '17 at 01:13

1 Answers1

1

The fact that word[0].isdigit(): throws the error implies that word is an empty string. The most likely cause of this is that your regex split produces empty strings on occasion.

The solution would be to, after

words = re.split(self.word_split_pattern, sentence)

put the line

words = [w for w in words if len(w) > 0]
John Coleman
  • 51,337
  • 7
  • 54
  • 119