Questions tagged [nltk]

The Natural Language Toolkit is a Python library for computational linguistics.

The Natural Language ToolKit (NLTK) is a Python library for computational linguistics. It is currently available for Python versions 2.7 or 3.2+

NLTK includes a great number of common natural language processing tools including a tokenizer, chunker, a part of speech (POS) tagger, a stemmer, a lemmatizer, and various classifiers such as Naive Bayes and Decision Trees. In addition to these tools, NLTK has built in many common corpora including the Brown Corpus, Reuters, and WordNet. The NLTK corpora collection also includes a few non-English corpora in Portuguese, Polish and Spanish.

The book Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper is freely available online under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US Licence. A citable paper NLTK: the natural language ToolKit was first published in 2003 and then again in 2006 for researchers to acknowledge the contribution in ongoing research in Computational Linguistics.

NLTK is currently distributed under an Apache version 2.0 licence.

7139 questions
26
votes
7 answers

Unable to install nltk on Mac OS El Capitan

I did sudo pip install -U nltk as suggested by the nltk documentation. However, I am getting the following output: Collecting nltk Downloading nltk-3.0.5.tar.gz (1.0MB) 100% |████████████████████████████████| 1.0MB 516kB/s Collecting…
proutray
  • 1,943
  • 3
  • 30
  • 48
26
votes
14 answers

Resource 'corpora/wordnet' not found on Heroku

I'm trying to get NLTK and wordnet working on Heroku. I've already done heroku run python nltk.download() wordnet pip install -r requirements.txt But I get this error: Resource 'corpora/wordnet' not found. Please use the NLTK Downloader to…
user1881006
  • 273
  • 1
  • 3
  • 7
25
votes
7 answers

Determine if text is in English?

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true: [ "this is some text written in English", "this is…
ocean800
  • 3,489
  • 13
  • 41
  • 73
25
votes
3 answers

Generate bigrams with NLTK

I am trying to produce a bigram list of a given sentence for example, if I type, To be or not to be I want the program to generate to be, be or, or not, not to, to be I tried the following code but just gives me
Nikhil Raghavendra
  • 1,570
  • 5
  • 18
  • 25
25
votes
7 answers

NLTK Named Entity recognition to a Python list

I used NLTK's ne_chunk to extract named entities from a text: my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the…
Zlo
  • 1,150
  • 2
  • 18
  • 38
25
votes
5 answers

Determining tense of a sentence Python

Following several other posts, [e.g. Detect English verb tenses using NLTK , Identifying verb tenses in python, Python NLTK figure out tense ] I wrote the following code to determine tense of a sentence in Python using POS tagging: from nltk import…
kyrenia
  • 5,431
  • 9
  • 63
  • 93
25
votes
4 answers

Python NLTK: Bigrams trigrams fourgrams

I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that import nltk from nltk import word_tokenize from nltk.util import ngrams text = "Hi How are…
M.A.Hassan
  • 500
  • 2
  • 7
  • 16
25
votes
4 answers

How to navigate a nltk.tree.Tree?

I've chunked a sentence using: grammar = ''' NP: …
Roy Smith
  • 2,039
  • 3
  • 20
  • 27
25
votes
4 answers

Tokenization of Arabic words using NLTK

I'm using NLTK word_tokenizer to split a sentence into words. I want to tokenize this sentence: في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء The code I'm writing is: import re import nltk lex = u"…
Hady Elsahar
  • 2,121
  • 4
  • 29
  • 47
24
votes
1 answer

pronoun resolution backwards

The usual coreference resolution works in the following way: Provided The man likes math. He really does. it figures out that he refers to the man. There are plenty of tools to do this. However, is there a way to do it backwards? For…
ytrewq
  • 3,670
  • 9
  • 42
  • 71
24
votes
1 answer

Combining text stemming and removal of punctuation in NLTK and scikit-learn

I am using a combination of NLTK and scikit-learn's CountVectorizer for stemming words and tokenization. Below is an example of the plain usage of the CountVectorizer: from sklearn.feature_extraction.text import CountVectorizer vocab = ['The…
user2489252
24
votes
3 answers

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

I basically have the same question as this guy.. The example in the NLTK book for the Naive Bayes classifier considers only whether a word occurs in a document as a feature.. it doesn't consider the frequency of the words as the feature to look at…
Ben G
  • 26,091
  • 34
  • 103
  • 170
23
votes
3 answers

Are there any classes in NLTK for text normalizing and canonizing?

The prevalent amount of NLTK documentation and examples is devoted to lemmatization and stemming but is very sparse on such matters of normalization as: converting all letters to lower or upper case removing punctuation converting numbers into…
soshial
  • 5,906
  • 6
  • 32
  • 40
23
votes
10 answers

Adding words to nltk stoplist

I have some code that removes stop words from my data set, as the stop list doesn't seem to remove a majority of the words I would like it too, I'm looking to add words to this stop list so that it will remove them for this case. The code i'm using…
Alex
  • 1,853
  • 5
  • 16
  • 15
23
votes
7 answers

Efficient Context-Free Grammar parser, preferably Python-friendly

I am in need of parsing a small subset of English for one of my project, described as a context-free grammar with (1-level) feature structures (example) and I need to do it efficiently . Right now I'm using NLTK's parser which produces the right…
Max Shawabkeh
  • 37,799
  • 10
  • 82
  • 91