0

I have successfully used the concordance() method in NLTK with my own text file that I read in through the Gutenberg Corpus:

    bom = open('sentences-with-emoji.txt')
    from nltk.text import Text
    bom = Text(nltk.corpus.gutenberg.words('/my-own-text-file.txt'))
    bom.concordance('messiah')

I say "through" because the concordance() method only reads words through the specified corpus, which is the Gutenberg. The Gutenberg corpus doesn't have emoji in it. So when I try a different file containing emoji like this:

    bom = open('sentences-with-emoji.txt’)
    from nltk.text import Text
    bom = Text(nltk.corpus.gutenberg.words('/sentences-with-emoji.txt'))
    bom.concordance('')

I get the response:

No matches

Do I have to create an annotated corpus (using the process here: Creating a new corpus with NLTK) with my /sentences-with-emoji.txt file in order to use the concordance() method with emoji?

matt_07734
  • 347
  • 2
  • 13

1 Answers1

1

nltk.text requires you to pass a list of tokens. Also, you don't have to create a new corpus or make the extra roundtrip through gutenberg.words. It is sufficient to load and tokenize a raw text file.

# raw = open('sentences-with-emoji.txt').read()
raw = 'word  word'
tokens = nltk.word_tokenize(raw)

text = Text(tokens)
text.concordance('')

Displaying 1 of 1 matches:
                                  word  word
Jan Trienes
  • 2,501
  • 1
  • 16
  • 28
  • So I can replicate what you did above, however, when I try to access an emoji in the file I get `no matches`. Any ideas how to do be able to get a concordance of emoji from a file (instead of a variable)? – matt_07734 Dec 11 '17 at 17:10
  • @matt_07734 can you update the question with the way you are trying to load the file? – Jan Trienes Dec 11 '17 at 17:42
  • @matt_07734 you have to tokenize your raw text string first. Afterwards pass this list of tokens to the Text() constructor – Jan Trienes Dec 11 '17 at 18:01
  • Your response works as posted. I was using the wrong text file... Silly mistake on my part. Thanks for the answer though! – matt_07734 Dec 12 '17 at 03:31