Questions tagged [nltk]

The Natural Language Toolkit is a Python library for computational linguistics.

The Natural Language ToolKit (NLTK) is a Python library for computational linguistics. It is currently available for Python versions 2.7 or 3.2+

NLTK includes a great number of common natural language processing tools including a tokenizer, chunker, a part of speech (POS) tagger, a stemmer, a lemmatizer, and various classifiers such as Naive Bayes and Decision Trees. In addition to these tools, NLTK has built in many common corpora including the Brown Corpus, Reuters, and WordNet. The NLTK corpora collection also includes a few non-English corpora in Portuguese, Polish and Spanish.

The book Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper is freely available online under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US Licence. A citable paper NLTK: the natural language ToolKit was first published in 2003 and then again in 2006 for researchers to acknowledge the contribution in ongoing research in Computational Linguistics.

NLTK is currently distributed under an Apache version 2.0 licence.

7139 questions
65
votes
12 answers

Spell Checker for Python

I'm fairly new to Python and NLTK. I am busy with an application that can perform spell checks (replaces an incorrectly spelled word with the correct one). I'm currently using the Enchant library on Python 2.7, PyEnchant and the NLTK library. The…
Mike Barnes
  • 4,217
  • 18
  • 40
  • 64
61
votes
4 answers

str.translate gives TypeError - Translate takes one argument (2 given), worked in Python 2

I have the following code import nltk, os, json, csv, string, cPickle from scipy.stats import scoreatpercentile lmtzr = nltk.stem.wordnet.WordNetLemmatizer() def sanitize(wordList): answer = [word.translate(None, string.punctuation) for word in…
carebear
  • 751
  • 2
  • 8
  • 16
57
votes
4 answers

Programmatically install NLTK corpora / models, i.e. without the GUI downloader?

My project uses the NLTK. How can I list the project's corpus & model requirements so they can be automatically installed? I don't want to click through the nltk.download() GUI, installing packages one by one. Also, any way to freeze that same list…
Bluu
  • 5,226
  • 4
  • 29
  • 34
54
votes
4 answers

Counting the Frequency of words in a pandas data frame

I have a table like below: URN Firm_Name 0 104472 R.X. Yah & Co 1 104873 Big Building Society 2 109986 St James's Society 3 114058 The Kensington Society Ltd 4 113438 MMV Oil…
J Reza
  • 579
  • 1
  • 4
  • 5
54
votes
7 answers

Improving the extraction of human names with nltk

I am trying to extract human names from text. Does anyone have a method that they would recommend? This is what I tried (code is below): I am using nltk to find everything marked as a person and then generating a list of all the NNP parts of that…
e h
  • 8,435
  • 7
  • 40
  • 58
52
votes
14 answers

NLTK Lookup Error

While running a Python script using NLTK I got this: Traceback (most recent call last): File "cpicklesave.py", line 56, in pos = nltk.pos_tag(words) File "/usr/lib/python2.7/site-packages/nltk/tag/__init__.py", line 110, in pos_tag …
Shiv Shankar
  • 1,007
  • 2
  • 8
  • 13
52
votes
2 answers

BeatifulSoup4 get_text still has javascript

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this? I tried using nltk which works fine however, clean_html and clean_url will be removed…
KVISH
  • 12,923
  • 17
  • 86
  • 162
51
votes
5 answers

tag generation from a text content

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools. Additionally, I will be grateful if you point any Python based solution / library…
Hellnar
  • 62,315
  • 79
  • 204
  • 279
50
votes
3 answers

Tokenize a paragraph into sentence and then into words in NLTK

I am trying to input an entire paragraph into my word processor to be split into sentences first and then into words. I tried the following code but it does not work, #text is the paragraph input sent_text = sent_tokenize(text) …
Nikhil Raghavendra
  • 1,570
  • 5
  • 18
  • 25
50
votes
3 answers

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

I am working on keyword extraction problem. Consider the very general case from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english') t = """Two Travellers, walking in the noonday…
AbtPst
  • 7,778
  • 17
  • 91
  • 172
50
votes
6 answers

NLTK Named Entity Recognition with Custom Data

I'm trying to extract named entities from my text using NLTK. I find that NLTK NER is not very accurate for my purpose and I want to add some more tags of my own as well. I've been trying to find a way to train my own NER, but I don't seem to be…
user1502248
  • 501
  • 1
  • 4
  • 3
49
votes
3 answers

Save Naive Bayes Trained Classifier in NLTK

I'm slightly confused in regard to how I save a trained classifier. As in, re-training a classifier each time I want to use it is obviously really bad and slow, how do I save it and the load it again when I need it? Code is below, thanks in advance…
user179169
48
votes
5 answers

How to create a word cloud from a corpus in Python?

From Creating a subset of words from a corpus in R, the answerer can easily convert a term-document matrix into a word cloud easily. Is there a similar function from python libraries that takes either a raw word textfile or NLTK corpus or Gensim…
alvas
  • 115,346
  • 109
  • 446
  • 738
46
votes
4 answers

Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?

Using NLTK and WordNet, how do I convert simple tense verb into its present, past or past participle form? For example: I want to write a function which would give me verb in expected form as follows. v = 'go' present = present_tense(v) print…
Software Enthusiastic
  • 25,147
  • 16
  • 58
  • 68
46
votes
5 answers

Docker NLTK Download

I am building a docker container using the following Dockerfile: FROM ubuntu:14.04 RUN apt-get update RUN apt-get install -y python python-dev python-pip ADD . /app RUN apt-get install -y python-scipy RUN pip install -r…
GNMO11
  • 2,099
  • 4
  • 19
  • 28