37

I'm a Literature grad student, and I've been going through the O'Reilly book in Natural Language Processing (nltk.org/book). It looks incredibly useful. I've played around with all the example texts and example tasks in Chapter 1, like concordances. I now know how many times Moby Dick uses the word "whale." The problem is, I can't figure out how to do these calculations on one of my own texts. I've found information on how to create my own corpora (Ch. 2 of the O'Reilly book), but I don't think that's exactly what I want to do. In other words, I want to be able to do

import nltk 
text1.concordance('yellow')

and get the places where the word 'yellow' is used in my text. At the moment I can do this with the example texts, but not my own.

I'm very new to python and programming, and so this stuff is very exciting, but very confusing.

Jonathan
  • 10,571
  • 13
  • 67
  • 103

3 Answers3

73

Found the answer myself. That's embarrassing. Or awesome.

From Ch. 3:

f=open('my-file.txt','rU')
raw=f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

Does the trick.

Jonathan
  • 10,571
  • 13
  • 67
  • 103
14

For a structured import of multiple files:

from nltk.corpus import PlaintextCorpusReader

# RegEx or list of file names
files = ".*\.txt"

corpus0 = PlaintextCorpusReader("/path/", files)
corpus  = nltk.Text(corpus0.words())

see: NLTK 3 book / section 1.9

Raffael
  • 19,547
  • 15
  • 82
  • 160
  • I was happy to see this, since the previous method (above) didn't work for me. Alas, another error message. It didn't like the line involving PlaintextCorpusReader: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 50: invalid continuation byte – Rafael_Espericueta Jul 12 '15 at 01:54
  • 3
    to resolve utf8 error, add encoding: PlaintextCorpusReader(path, '.*', encoding='latin-1') – Alex Nano May 29 '18 at 17:16
0

If your text file is in utf8 format, try the following variation:

f=open('my-file.txt','r',encoding='utf8')
raw=f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
goldenduck
  • 21
  • 3