Questions tagged [nltk]

The Natural Language Toolkit is a Python library for computational linguistics.

The Natural Language ToolKit (NLTK) is a Python library for computational linguistics. It is currently available for Python versions 2.7 or 3.2+

NLTK includes a great number of common natural language processing tools including a tokenizer, chunker, a part of speech (POS) tagger, a stemmer, a lemmatizer, and various classifiers such as Naive Bayes and Decision Trees. In addition to these tools, NLTK has built in many common corpora including the Brown Corpus, Reuters, and WordNet. The NLTK corpora collection also includes a few non-English corpora in Portuguese, Polish and Spanish.

The book Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper is freely available online under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US Licence. A citable paper NLTK: the natural language ToolKit was first published in 2003 and then again in 2006 for researchers to acknowledge the contribution in ongoing research in Computational Linguistics.

NLTK is currently distributed under an Apache version 2.0 licence.

7139 questions
2
votes
1 answer

How to use python to assemble properly formulated sentences from random lines of text

Let's say I have a database that contains 200,000 lines of poetry, and I want to randomly combine those lines in ways that generate grammatically correct and legible 3-line poems. Is there a way to do that? I'm currently experimenting with…
RobB
  • 43
  • 4
2
votes
1 answer

Find all the variations (or tenses) of a word in Python

I would like to know how you would find all the variations of a word, or the words that are related or very similar the the original word in Python. An example of the sort of thing I am looking for is like this: word = "summary" # any…
2
votes
1 answer

In nltk wordnet, wn.synsets.definition(lang="lang") show enlish and japanese, but not other languages

wn.synsets.definition(lang="lang") show english and japanese result, but not other languages. wn.synset('word').lemma_names shows the other languages too, though. Do I need extra download? , there is the difference between languages? the documents…
2
votes
1 answer

How to efficiently build ngrams based on categories in a dataframe

Problem I have a dataframe that consists of text which belongs to a category. I now want to get the most commonly used n-grams (bigrams in the example) per category. I managed to do this, but the code for this is way too long in my opinion. Sample…
Elodin
  • 386
  • 1
  • 10
2
votes
1 answer

How to extract noun-based compound words from a sentence using Python?

I'm using nltk via the following code to extract nouns from a sentence: words = nltk.word_tokenize(sentence) tags = nltk.pos_tag(words) And then I choose the words tagged with the NN and NNP Part of Speech (PoS) tags. However, it only extracts…
Alan K
  • 187
  • 2
  • 15
2
votes
1 answer

Wordnet taxonomy construction

I'd like to build a minimum encompassing taxonomic tree for a given set of wordnet synsets. For a set of 2 synsets the tree would be one where they are both children nodes of their lowest common hypernym. For the following set: [{'name':…
Iyar Lin
  • 581
  • 4
  • 13
2
votes
1 answer

How to pass variables/functions from javaScript to Python and vice versa?

I am creating a website in HTML, CSS and JavaScript where I require an AI powered chatbot. I have the required python file which consists of the logic for the chatbot (AI, NLTK). Now, in the python file, I have a function named "response()" which…
2
votes
1 answer

How can I separate a superscript from its root word using python?

I am working on a project that requires the separating of a superscript from its root word so that it can be tokenized as a separate token. if I tokenize "This is a sentence about testString™" the results will be ["this, "is", "a", "sentence",…
alex
  • 31
  • 4
2
votes
0 answers

sklearn.linear_model LogisticRegression classifier training

I am trying to use LogisticRegression classifier for the use case below. corpus = [ { classifier: "appt-count", text: "How many appointments I have for today?" }, { classifier:…
user1578872
  • 7,808
  • 29
  • 108
  • 206
2
votes
2 answers

Error in NLTK tokenisation (words with two consequent same letters gets split)

I am facing problems while using nltk.tokenize.words_tokenize in my code. My code is as follows: def clean_str_and_tokenise(line): ''' STEP 1: Remove punctuation marks from the input string and convert the entire string to…
2
votes
1 answer

Lemmatizer/PoS-tagger for italian in Python

I'm searching for a Lemmatizer/PoS-tagger for the Italian language, that works on Python. I tried with Spacy, it works but it's not very precise, expecially for verbs it often returns the wrong lemma. NLKT has only english as language. I'm searching…
sunhearth
  • 93
  • 1
  • 9
2
votes
1 answer

nltk.lemmatizer doesn't work for even a simple input text

Sorry guys I'm new to NLP and I'm trying to apply NLTK Lemmatizer to the whole input text, however it seems not to work for even a simple sentence. from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.tokenize import…
jsacharz
  • 21
  • 1
2
votes
1 answer

WordNet not returning pertainym for "South Korean" even though pertainym exists - Python

I'm trying to do a pertainym search for "South Korean": input = "South Korean.a.01.South Korean" lemma = wn.lemma(input) According to the Princeton WordNet page, this should return "South Korea"... yet in my code I'm getting the error message that…
KCpremo
  • 55
  • 7
2
votes
2 answers

Remove a certain word from a list of sentences

Is there a way to remove a certain word from a list of sentences if that word appears after a list of words? For example, I want to remove the word "and" if "and" appears after a list of words ([ "red", "blue", "green"]). I know how to remove a word…
2
votes
0 answers

nltk PunktSentenceTokenizer: tokenize sentences without whitespace in between

Is it possible to make the NLTK PunktSentenceTokenizer to split sentences that do not have whitespace between each other? from nltk.tokenize.punkt import PunktSentenceTokenizer sent_tokenizer =…
revy
  • 3,945
  • 7
  • 40
  • 85