Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
6
votes
1 answer

scikit-learn, add features to a vectorized set of documents

I am starting with scikit-learn and I am trying to transform a set of documents into a format on which I could apply clustering and classification. I have seen the details about the vectorization methods, and the tfidf transformations to load the…
Mortimer
  • 2,966
  • 23
  • 24
6
votes
3 answers

R remove stopwords from a character vector using %in%

I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword…
screechOwl
  • 27,310
  • 61
  • 158
  • 267
6
votes
3 answers

How to calculate readabilty in R with the tm package

Is there a pre-built function for this in the tm library, or one that plays nicely with it? My current corpus is loaded into tm, something like as follows: s1 <- "This is a long, informative document with real words and sentence structure: …
Mittenchops
  • 18,633
  • 33
  • 128
  • 246
6
votes
1 answer

detect allusions (e.g. very fuzzy matches) in language of inaugural addresses

I'm trying to develop a Python script to examine every sentence in Barack Obama's second inaugural address and find similar sentences in past inaugurals. I've developed a very crude fuzzy match, and I'm hoping to improve it. I start by reducing all…
Chris Wilson
  • 6,599
  • 8
  • 35
  • 71
6
votes
2 answers

NLTK makes it easy to compute bigrams of words. What about letters?

I've seen tons of documentation all over the web about how the python NLTK makes it easy to compute bigrams of words. What about letters? What I want to do is plug in a dictionary and have it tell me the relative frequencies of different letter…
isthmuses
  • 1,316
  • 1
  • 17
  • 27
6
votes
1 answer

How to efficiently compute similarity between documents in a stream of documents

I gather Text documents (in Node.js) where one document i is represented as a list of words. What is an efficient way to compute the similarity between these documents, taking into account that new documents are coming as a sort of stream of…
6
votes
3 answers

Square brackets applied to "self" in Python

I've come across some code where square brackets are used on "self". I'm not familiar with this notation and as I'm trying to get my head around source code not written by me, it makes it difficult to understand what sort of object is being dealt…
user1002973
  • 2,088
  • 6
  • 22
  • 31
6
votes
3 answers

list of english verbs and their tenses, various forms, etc

Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)? I imagine this will be useful for NLP systems, but there doesn't seem to be a listing…
kamziro
  • 7,882
  • 9
  • 55
  • 78
6
votes
3 answers

which is better... GATE or RapidMiner

I've started to write a simple sentiment analysis tool. Currently I am looking at GATE and RapidMiner but being a beginner not able to concentrate on both. Could someone please tell me which one will be better in terms of usage, learning curve,…
siva
  • 1,105
  • 4
  • 19
  • 38
6
votes
1 answer

korean language tokenizer

What is the best tokenizer exist for processing Korean language? I have tried CJKTokenizer in Solr4.0. It is doing the tokenization, but accuracy is very low.
gangatharan
  • 781
  • 1
  • 12
  • 28
6
votes
1 answer

Counting with scipy.sparse

I am using the Python sklearn libraries. I have 150,000+ sentences. I need an array-like object, where each row is for a sentences, each column corresponds to a word, and each element is the number of words in that sentence. For example: If the two…
Paul Draper
  • 78,542
  • 46
  • 206
  • 285
6
votes
1 answer

How can I generate parse trees of English sentences on iOS?

I would like to generate constituency-based parsed trees of English sentences within an iOS application. http://en.wikipedia.org/wiki/Parse_tree My current options appear to be: Write my own tree generation on top of POS tagging from…
Giles
  • 1,428
  • 11
  • 21
6
votes
2 answers

Horizontal Markovization

I have to implement horizontal markovization (NLP concept) and I'm having a little trouble understanding what the trees will look like. I've been reading the Klein and Manning paper, but they don't explain what the trees with horizontal…
Josh Bradley
  • 4,630
  • 13
  • 54
  • 79
6
votes
6 answers

Processing English Statements

Any recommendations for languages/libraries to convert sentence like: "X bumped Y, who in turn kicked Z." to X: Bumped Y: Was bumped, kicked Z
lecter
6
votes
5 answers

Disease named entity recognition

I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here: Primary pulmonary hypertension is a progressive disease in which widespread…
alex
  • 833
  • 4
  • 12
  • 21