Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Text pre-processing
Coreference resolution
Dependency parsing parse-tree
Document summarization summarization
Named entity recognition (NER) named-entity-recognition
Information extraction (IE) information-retrieval information-extraction
Language modeling
Part-of-speech (POS) tagging part-of-speech
Morphological analysis and wordform generation
Phrase-structure (constituency) parsing parse-tree
Machine translation (MT) machine-translation
Question answering (QA) nlp-question-answering
Sentiment analysis sentiment-analysis
Semantic parsing semantic-analysis
Text categorization text-classification document-classification
Textual entailment detection
Topic modeling topic-modeling
Word Sense Disambiguation (WSD) word-sense-disambiguation

Beginner books on Natural Language Processing

Popular software packages

General purpose toolkits
- Natural Language Toolkit (NLTK) (Python) nltk
- OpenNLP (Java) opennlp
- Sharp NLP (.NET) sharpnlp
- ClearNLP (Java) clearnlp
- Mate (Java)
- Stanford CoreNLP (Java) stanford-nlp
- Treat (Ruby)
- Mallet (Java) mallet
- spaCy (Python) spacy
- Pattern (Python) python-pattern
Phrase structure parsers
- Stanford Parser (Java) stanford-nlp
- Berkeley Parser (Java)
- BLLIP (Charniak-Johnson) Parser (C++, Python) charniak-parser
Dependency parsers
- Stanford Dependencies (packaged with Stanford parser) (Java) stanford-nlp
- MaltParser (Java)
- MSTParser (Java)
- UDPipe
Proof reading software
- LanguageTool (Java) languagetool

20185 questions

votes

1 answer

Difference between Python's collections.Counter and nltk.probability.FreqDist

I want to calculate the term-frequencies of words in a text corpus. I've been using NLTK's word_tokenize followed by probability.FreqDist for some time to get this done. The word_tokenize returns a list, which is converted to a frequency…

python nlp nltk

asked Jan 05 '16 at 03:58

Prateek Dewan

1,587
3
16
29

votes

3 answers

Extract inconsistently formatted date from string (date parsing, NLP)

I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that,…

perl date nlp

asked Aug 10 '10 at 01:16

El Yobo

14,823
5
60
78

votes

3 answers

NLTK - WordNet: list of long words

I would like to find words in WordNet that are at least 18 character long. I tried the following code: from nltk.corpus import wordnet as wn sorted(w for w in wn.synset().name() if len(w)>18) I get the following error message: sorted(w for w in…

python nlp nltk wordnet

asked Dec 04 '15 at 07:26

Cornelius

votes

3 answers

concordance for a phrase using NLTK in Python

Is it possible to get concordance for a phrase in NLTK? import nltk from nltk.corpus import PlaintextCorpusReader corpus_loc = "c://temp//text//" files = ".*\.txt" read_corpus = PlaintextCorpusReader(corpus_loc, files) corpus =…

python nlp nltk

asked Nov 19 '15 at 20:07

Naresh MG

votes

2 answers

Training Tagger with Custom Tags in NLTK

I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]. I want to train a model based on a set of these type of…

nlp nltk information-extraction supervised-learning

asked Nov 15 '15 at 06:37

Hamman Samuel

2,350
4
30
41

votes

1 answer

scikit weighted f1 score calculation and usage

I have a question regarding weighted average in sklearn.metrics.f1_score sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average='weighted', sample_weight=None) Calculate metrics for each label, and find their average, weighted…

machine-learning nlp scikit-learn precision-recall

asked Oct 25 '15 at 06:17

com

2,606
6
29
44

votes

1 answer

Using scikit-learn to training an NLP log linear model for NER

I wonder how to use sklearn.linear_model.LogisticRegression to train an NLP log linear model for named-entity recognition (NER). A typical log-linear model for defines a conditional probability as follows: with: x: the current word y: the class of…

nlp scikit-learn

asked Oct 20 '15 at 23:59

Franck Dernoncourt

77,520
72
342
501

votes

1 answer

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect =…

python nlp scikit-learn tokenize n-gram

asked Aug 20 '15 at 21:35

Franck Dernoncourt

77,520
72
342
501

votes

1 answer

Sentence tokenization for texts that contains quotes

Code: from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text))) Output: [After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat…

python nlp nltk tokenize

asked Aug 14 '15 at 06:03

Abhishek Bhatia

9,404
26
87
142

votes

1 answer

How to Traverse an NLTK Tree object?

Given a bracketed parse, I could convert it into a Tree object in NLTK as such: >>> from nltk.tree import Tree >>> s = '(ROOT (S (NP (NNP Europe)) (VP (VBZ is) (PP (IN in) (NP (DT the) (JJ same) (NNS trends)))) (. .)))' >>>…

parsing tree nlp nltk depth-first-search

asked Jul 29 '15 at 01:06

alvas

115,346
109
446
738

votes

2 answers

What can I do to speed up Stanford CoreNLP (dcoref/ner)?

I'm processing a large amount of documents using Stanford's CoreNLP library alongside the Stanford CoreNLP Python Wrapper. I'm using the following annotators: tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref along with the…

python nlp stanford-nlp

asked Jul 22 '15 at 12:16

Ayrton Massey

votes

6 answers

how to create exclamations for a particular sentence

I would like to create exclamations for a particular sentence using the java API? e.g. It's surprising == Isn't it surprising! e.g. It's cold == Isn't it cold! Are there any vendors or tools which help you generate exclamations, provided you give…

java regex nlp text-manipulation

asked Jun 23 '10 at 11:32

user339108

12,613
33
81
112

votes

2 answers

Language detection API/Library

Is there a service/library (free or paid) that takes a piece of text and return the language of it? I need to go over a million blog posts and determine their languages.

api nlp

asked Jun 14 '15 at 17:22

J. Dorian

votes

6 answers

Detecting similar paragraphs in two documents

I am trying to find similar paragraphs in 2 documents. Each document has many paragraphs of multiple lines of text. The text in paragraphs has some changes. The words can be inserted or deleted or miss-spelled. For example Doc1.Para This is one line…

machine-learning nlp

asked May 17 '15 at 10:07

uzair_syed

votes

1 answer

Only ignore stop words for ngram_range=1

I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3). From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for…

python nlp scikit-learn

asked May 09 '15 at 22:50

Natalie Arellano

Prev 1 2 3

…

99 100 Next