How do I create my own NLTK text from a text file?

Question

I'm a Literature grad student, and I've been going through the O'Reilly book in Natural Language Processing (nltk.org/book). It looks incredibly useful. I've played around with all the example texts and example tasks in Chapter 1, like concordances. I now know how many times Moby Dick uses the word "whale." The problem is, I can't figure out how to do these calculations on one of my own texts. I've found information on how to create my own corpora (Ch. 2 of the O'Reilly book), but I don't think that's exactly what I want to do. In other words, I want to be able to do

import nltk 
text1.concordance('yellow')

and get the places where the word 'yellow' is used in my text. At the moment I can do this with the example texts, but not my own.

I'm very new to python and programming, and so this stuff is very exciting, but very confusing.

This question illustrates some deep problems with the nltk documentation. I sympathize. — Jon Kiparsky, Jan 19 '21 at 01:15

score 73 · Accepted Answer · answered May 06 '12 at 00:22

73

Found the answer myself. That's embarrassing. Or awesome.

From Ch. 3:

f=open('my-file.txt','rU')
raw=f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

Does the trick.

answered May 06 '12 at 00:22

Jonathan

10,571
13
67
103

4

Excellent. I was just trying to answer this question myself; very glad I stumbled on your (self)answer. -- Another Literature Scholar – cforster Aug 14 '13 at 19:48
1

For this to work, I first needed to download "punkt": nltk.download('punkt') – Rafael_Espericueta Jul 12 '15 at 02:08
What does the rU do ? Found it: f = open('myfile.txt', 'rU') # rU means "read", and handles line endings – ProfVersaggi Oct 09 '15 at 14:25

score 14 · Answer 2 · answered Mar 05 '15 at 14:17

14

For a structured import of multiple files:

from nltk.corpus import PlaintextCorpusReader

# RegEx or list of file names
files = ".*\.txt"

corpus0 = PlaintextCorpusReader("/path/", files)
corpus  = nltk.Text(corpus0.words())

see: NLTK 3 book / section 1.9

answered Mar 05 '15 at 14:17

Raffael

19,547
15
82
160

I was happy to see this, since the previous method (above) didn't work for me. Alas, another error message. It didn't like the line involving PlaintextCorpusReader: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 50: invalid continuation byte – Rafael_Espericueta Jul 12 '15 at 01:54
3

to resolve utf8 error, add encoding: PlaintextCorpusReader(path, '.*', encoding='latin-1') – Alex Nano May 29 '18 at 17:16

score 0 · Answer 3 · answered Sep 07 '20 at 09:35

0

If your text file is in utf8 format, try the following variation:

f=open('my-file.txt','r',encoding='utf8')
raw=f.read()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)

answered Sep 07 '20 at 09:35

goldenduck

21
3

How do I create my own NLTK text from a text file?

3 Answers3