Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Text pre-processing
Coreference resolution
Dependency parsing parse-tree
Document summarization summarization
Named entity recognition (NER) named-entity-recognition
Information extraction (IE) information-retrieval information-extraction
Language modeling
Part-of-speech (POS) tagging part-of-speech
Morphological analysis and wordform generation
Phrase-structure (constituency) parsing parse-tree
Machine translation (MT) machine-translation
Question answering (QA) nlp-question-answering
Sentiment analysis sentiment-analysis
Semantic parsing semantic-analysis
Text categorization text-classification document-classification
Textual entailment detection
Topic modeling topic-modeling
Word Sense Disambiguation (WSD) word-sense-disambiguation

Beginner books on Natural Language Processing

Popular software packages

General purpose toolkits
- Natural Language Toolkit (NLTK) (Python) nltk
- OpenNLP (Java) opennlp
- Sharp NLP (.NET) sharpnlp
- ClearNLP (Java) clearnlp
- Mate (Java)
- Stanford CoreNLP (Java) stanford-nlp
- Treat (Ruby)
- Mallet (Java) mallet
- spaCy (Python) spacy
- Pattern (Python) python-pattern
Phrase structure parsers
- Stanford Parser (Java) stanford-nlp
- Berkeley Parser (Java)
- BLLIP (Charniak-Johnson) Parser (C++, Python) charniak-parser
Dependency parsers
- Stanford Dependencies (packaged with Stanford parser) (Java) stanford-nlp
- MaltParser (Java)
- MSTParser (Java)
- UDPipe
Proof reading software
- LanguageTool (Java) languagetool

20185 questions

votes

1 answer

scikit-learn, add features to a vectorized set of documents

I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the…

python machine-learning nlp scikit-learn

asked Mar 06 '13 at 20:47

Mortimer

2,966
23
24

votes

3 answers

R remove stopwords from a character vector using %in%

I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword…

r nlp subset tm stop-words

asked Mar 06 '13 at 17:15

screechOwl

27,310
61
158
267

votes

3 answers

How to calculate readabilty in R with the tm package

Is there a pre-built function for this in the tm library, or one that plays nicely with it? My current corpus is loaded into tm, something like as follows: s1 <- "This is a long, informative document with real words and sentence structure: …

r nlp tm

asked Feb 13 '13 at 16:31

Mittenchops

18,633
33
128
246

votes

1 answer

detect allusions (e.g. very fuzzy matches) in language of inaugural addresses

I'm trying to develop a Python script to examine every sentence in Barack Obama's second inaugural address and find similar sentences in past inaugurals. I've developed a very crude fuzzy match, and I'm hoping to improve it. I start by reducing all…

python text nlp nltk

asked Jan 23 '13 at 23:29

Chris Wilson

6,599
8
35
71

votes

2 answers

NLTK makes it easy to compute bigrams of words. What about letters?

I've seen tons of documentation all over the web about how the python NLTK makes it easy to compute bigrams of words. What about letters? What I want to do is plug in a dictionary and have it tell me the relative frequencies of different letter…

python nlp nltk n-gram

asked Jan 05 '13 at 04:33

isthmuses

1,316
1
17
27

votes

1 answer

How to efficiently compute similarity between documents in a stream of documents

I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of…

node.js stream nlp cosine-similarity term-document-matrix

asked Dec 21 '12 at 08:17

Alexandre Kaspar

votes

3 answers

Square brackets applied to "self" in Python

I've come across some code where square brackets are used on "self". I'm not familiar with this notation and as I'm trying to get my head around source code not written by me, it makes it difficult to understand what sort of object is being dealt…

python nlp

asked Dec 17 '12 at 21:58

user1002973

2,088
6
22
31

votes

3 answers

list of english verbs and their tenses, various forms, etc

Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)? I imagine this will be useful for NLP systems, but there doesn't seem to be a listing…

nlp

asked Dec 13 '12 at 05:33

kamziro

7,882
9
55
78

votes

3 answers

which is better... GATE or RapidMiner

I've started to write a simple sentiment analysis tool. Currently I am looking at GATE and RapidMiner but being a beginner not able to concentrate on both. Could someone please tell me which one will be better in terms of usage, learning curve,…

nlp

asked Sep 01 '09 at 07:29

siva

1,105
4
19
38

votes

1 answer

korean language tokenizer

What is the best tokenizer exist for processing Korean language? I have tried CJKTokenizer in Solr4.0. It is doing the tokenization, but accuracy is very low.

localization solr nlp tokenize

asked Nov 20 '12 at 04:25

gangatharan

votes

1 answer

Counting with scipy.sparse

I am using the Python sklearn libraries. I have 150,000+ sentences. I need an array-like object, where each row is for a sentences, each column corresponds to a word, and each element is the number of words in that sentence. For example: If the two…

python nlp scipy sparse-matrix scikit-learn

asked Nov 08 '12 at 17:18

Paul Draper

78,542
46
206
285

votes

1 answer

How can I generate parse trees of English sentences on iOS?

I would like to generate constituency-based parsed trees of English sentences within an iOS application. http://en.wikipedia.org/wiki/Parse_tree My current options appear to be: Write my own tree generation on top of POS tagging from…

ios nlp linguistics

asked Nov 07 '12 at 17:16

Giles

1,428
11
21

votes

2 answers

Horizontal Markovization

I have to implement horizontal markovization (NLP concept) and I'm having a little trouble understanding what the trees will look like. I've been reading the Klein and Manning paper, but they don't explain what the trees with horizontal…

parsing tree nlp context-free-grammar

asked Oct 14 '12 at 16:54

Josh Bradley

4,630
13
54
79

votes

6 answers

Processing English Statements

Any recommendations for languages/libraries to convert sentence like: "X bumped Y, who in turn kicked Z." to X: Bumped Y: Was bumped, kicked Z

nlp

asked Aug 12 '09 at 13:56

lecter

votes

5 answers

Disease named entity recognition

I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here: Primary pulmonary hypertension is a progressive disease in which widespread…

machine-learning nlp medical named-entity-recognition

asked Sep 25 '12 at 08:15

alex

Prev 1 2 3

…

99 100 Next