6

I am going through this wonderful tutorial.

I downloaded a collection called book:

>>> import nltk
>>> nltk.download()

and imported texts:

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811

I can then run commands on these texts:

>>> text1.concordance("monstrous")

How can I run these nltk commands on my own dataset? Are these collections the same as the object book in python?

alvas
  • 115,346
  • 109
  • 446
  • 738
Alex Gordon
  • 57,446
  • 287
  • 670
  • 1,062
  • 1
    Do note that `import nltk` might not be necessary when you only need the `nltk.book` functions. – alvas Jul 19 '13 at 10:15

2 Answers2

4

You're right that it's quite hard to find the documentation for the book.py module. So we have to get our hands dirty and look at the code, (see here). Looking at the book.py, to do the conoordance and all the fancy stuff with the book module:

Firstly you have to have your raw texts put into nltk's corpus class, see Creating a new corpus with NLTK for more details.

Secondly you read the corpus words into the NLTK's Text class. Then you could use the functions that you see in http://nltk.org/book/ch01.html

from nltk.corpus import PlaintextCorpusReader
from nltk.text import Text

# For example, I create an example text file
text1 = '''
This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars.
'''
text2 = '''
One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep.
'''
# Creating the corpus
corpusdir = './mycorpus/' 
with (corpusdir+'text1.txt','w') as fout:
    fout.write(text1)
with (corpusdir+'text2.txt','w') as fout:
    fout.write(text2, fout)

# Read the the example corpus into NLTK's corpus class.
mycorpus = PlaintextCorpusReader(corpusdir, '.*')

# Read the NLTK's corpus into NLTK's text class, 
# where your book-like concoordance search is available
mytext = Text(mycorpus.words())

mytext.concoordance('foo')

NOTE: you can use other NLTK's CorpusReaders and even specify custom paragraph/sentence/word tokenizers and encoding but now, we'll stick to the default

alvas
  • 115,346
  • 109
  • 446
  • 738
  • Note for others...The helpful sample code above has some (trivial) errors, so don't use it literally or get caught up in understanding every detail. – Dan Nissenbaum Aug 04 '17 at 22:32
  • Ah there's some really old C like coding style from old Python, let me edit it =) – alvas Aug 04 '17 at 23:51
2

Text Analysis with NLTK Cheatsheet from bogs.princeton.edu https://blogs.princeton.edu/etc/files/2014/03/Text-Analysis-with-NLTK-Cheatsheet.pdf

Working with your own texts:

Open a file for reading

file = open('myfile.txt') 

Make sure you are in the correct directory before starting Python - or give the full path specification.

Read the file:

t = file.read() 

Tokenize the text:

tokens = nltk.word_tokenize(t)

Convert to NLTK Text object:

text = nltk.Text(tokens)
Drenmi
  • 8,492
  • 4
  • 42
  • 51
C2Infinity
  • 21
  • 3