Questions tagged [nltk]

The Natural Language Toolkit is a Python library for computational linguistics.

The Natural Language ToolKit (NLTK) is a Python library for computational linguistics. It is currently available for Python versions 2.7 or 3.2+

NLTK includes a great number of common natural language processing tools including a tokenizer, chunker, a part of speech (POS) tagger, a stemmer, a lemmatizer, and various classifiers such as Naive Bayes and Decision Trees. In addition to these tools, NLTK has built in many common corpora including the Brown Corpus, Reuters, and WordNet. The NLTK corpora collection also includes a few non-English corpora in Portuguese, Polish and Spanish.

The book Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper is freely available online under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US Licence. A citable paper NLTK: the natural language ToolKit was first published in 2003 and then again in 2006 for researchers to acknowledge the contribution in ongoing research in Computational Linguistics.

NLTK is currently distributed under an Apache version 2.0 licence.

7139 questions
35
votes
1 answer

Create a custom Transformer in PySpark ML

I am new to Spark SQL DataFrames and ML on them (PySpark). How can I create a custom tokenizer, which for example removes stop words and uses some libraries from nltk? Can I extend the default one?
Niko
  • 385
  • 1
  • 3
  • 8
34
votes
3 answers

Large scale machine learning - Python or Java?

I am currently embarking on a project that will involve crawling and processing huge amounts of data (hundreds of gigs), and also mining them for extracting structured data, named entity recognition, deduplication, classification etc. I'm familiar…
jeffreyveon
  • 13,400
  • 18
  • 79
  • 129
34
votes
6 answers

FreqDist with NLTK

The Python package nltk has the FreqDist function which gives you the frequency of words within a text. I am trying to pass my text as an argument but the result is of the form: [' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y',…
afg102
  • 361
  • 2
  • 4
  • 4
34
votes
2 answers

How is the Vader 'compound' polarity score calculated in Python NLTK?

I'm using the Vader SentimentAnalyzer to obtain the polarity scores. I used the probability scores for positive/negative/neutral before, but I just realized the "compound" score, ranging from -1 (most neg) to 1 (most pos) would provide a single…
alicecongcong
  • 379
  • 2
  • 4
  • 4
34
votes
3 answers

Classifying Documents into Categories

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to…
erikcw
  • 10,787
  • 15
  • 58
  • 75
33
votes
7 answers

NLTK vs Stanford NLP

I have recently started to use NLTK toolkit for creating few solutions using Python. I hear a lot of community activity regarding using Stanford NLP. Can anyone tell me the difference between NLTK and Stanford NLP? Are they two different libraries?…
RData
  • 959
  • 1
  • 13
  • 33
33
votes
4 answers

What does NN VBD IN DT NNS RB means in NLTK?

when I chunk text, I get lots of codes in the output like NN, VBD, IN, DT, NNS, RB. Is there a list documented somewhere which tells me the meaning of these? I have tried googling nltk chunk code nltk chunk grammar nltk chunk tokens. But I am not…
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
33
votes
8 answers

Python can't find module NLTK

I followed these instructions http://www.nltk.org/install.html to install nltk module on my mac (10.6) I have installed python 2.7, but when I open IDLE and type import nltk it gives me this error Traceback (most recent call last): File…
Foxsquirrel
  • 373
  • 1
  • 3
  • 8
33
votes
10 answers

Forming Bigrams of words in list of sentences with Python

I have a list of sentences: text = ['cant railway station','citadel hotel',' police stn']. I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I…
Hypothetical Ninja
  • 3,920
  • 13
  • 49
  • 75
33
votes
8 answers

Computing N Grams using Python

I needed to compute the Unigrams, BiGrams and Trigrams for a text file containing text like: "Cystic fibrosis affects 30,000 children and young adults in the US alone Inhaling the mists of salt water can reduce the pus and infection that fills the…
gran_profaci
  • 8,087
  • 15
  • 66
  • 99
32
votes
7 answers

What is NLTK POS tagger asking me to download?

I just started using a part-of-speech tagger, and I am facing many problems. I started POS tagging with the following: import nltk text=nltk.word_tokenize("We are going out.Just you and me.") When I want to print 'text', the following…
Pearl
  • 759
  • 1
  • 6
  • 7
32
votes
7 answers

Change nltk.download() path directory from default ~/ntlk_data

I was trying to download/update python nltk packages on a computing server and it returned this [Errno 122] Disk quota exceeded: error. Specifically: [nltk_data] Downloading package stop words to /home/sh2264/nltk_data... [nltk_data] Error…
shenglih
  • 879
  • 2
  • 8
  • 18
31
votes
2 answers

object of type 'generator' has no len()

I have just started to learn python. I want to write a program in NLTK that breaks a text into unigrams, bigrams. For example if the input text is... "I am feeling sad and disappointed due to errors" ... my function should generate text like: I…
Vishal Kharde
  • 1,553
  • 3
  • 16
  • 34
31
votes
2 answers

How do I test whether an nltk resource is already installed on the machine running my code?

I just started my first NLTK project and am confused about the proper setup. I need several resources like the Punkt Tokenizer and the maxent pos tagger. I myself downloaded them using the GUI nltk.download(). For my collaborators I of course want…
Zakum
  • 2,157
  • 2
  • 22
  • 30
31
votes
3 answers

Topic distribution: How do we see which document belong to which topic after doing LDA in python

I am able to run the LDA code from gensim and got the top 10 topics with their respective keywords. Now I would like to go a step further to see how accurate the LDA algo is by seeing which document they cluster into each topic. Is this possible in…
jxn
  • 7,685
  • 28
  • 90
  • 172