Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
6
votes
1 answer

Difference between Python's collections.Counter and nltk.probability.FreqDist

I want to calculate the term-frequencies of words in a text corpus. I've been using NLTK's word_tokenize followed by probability.FreqDist for some time to get this done. The word_tokenize returns a list, which is converted to a frequency…
Prateek Dewan
  • 1,587
  • 3
  • 16
  • 29
6
votes
3 answers

Extract inconsistently formatted date from string (date parsing, NLP)

I have a large list of files, some of which have dates embedded in the filename. The format of the dates is inconsistent and often incomplete, e.g. "Aug06", "Aug2006", "August 2006", "08-06", "01-08-06", "2006", "011004" etc. In addition to that,…
El Yobo
  • 14,823
  • 5
  • 60
  • 78
6
votes
3 answers

NLTK - WordNet: list of long words

I would like to find words in WordNet that are at least 18 character long. I tried the following code: from nltk.corpus import wordnet as wn sorted(w for w in wn.synset().name() if len(w)>18) I get the following error message: sorted(w for w in…
Cornelius
  • 63
  • 1
  • 4
6
votes
3 answers

concordance for a phrase using NLTK in Python

Is it possible to get concordance for a phrase in NLTK? import nltk from nltk.corpus import PlaintextCorpusReader corpus_loc = "c://temp//text//" files = ".*\.txt" read_corpus = PlaintextCorpusReader(corpus_loc, files) corpus =…
Naresh MG
  • 633
  • 2
  • 11
  • 19
6
votes
2 answers

Training Tagger with Custom Tags in NLTK

I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]. I want to train a model based on a set of these type of…
Hamman Samuel
  • 2,350
  • 4
  • 30
  • 41
6
votes
1 answer

scikit weighted f1 score calculation and usage

I have a question regarding weighted average in sklearn.metrics.f1_score sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average='weighted', sample_weight=None) Calculate metrics for each label, and find their average, weighted…
com
  • 2,606
  • 6
  • 29
  • 44
6
votes
1 answer

Using scikit-learn to training an NLP log linear model for NER

I wonder how to use sklearn.linear_model.LogisticRegression to train an NLP log linear model for named-entity recognition (NER). A typical log-linear model for defines a conditional probability as follows: with: x: the current word y: the class of…
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
6
votes
1 answer

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect =…
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
6
votes
1 answer

Sentence tokenization for texts that contains quotes

Code: from nltk.tokenize import sent_tokenize pprint(sent_tokenize(unidecode(text))) Output: [After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat…
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
6
votes
1 answer

How to Traverse an NLTK Tree object?

Given a bracketed parse, I could convert it into a Tree object in NLTK as such: >>> from nltk.tree import Tree >>> s = '(ROOT (S (NP (NNP Europe)) (VP (VBZ is) (PP (IN in) (NP (DT the) (JJ same) (NNS trends)))) (. .)))' >>>…
alvas
  • 115,346
  • 109
  • 446
  • 738
6
votes
2 answers

What can I do to speed up Stanford CoreNLP (dcoref/ner)?

I'm processing a large amount of documents using Stanford's CoreNLP library alongside the Stanford CoreNLP Python Wrapper. I'm using the following annotators: tokenize, ssplit, pos, lemma, ner, entitymentions, parse, dcoref along with the…
Ayrton Massey
  • 471
  • 3
  • 13
6
votes
6 answers

how to create exclamations for a particular sentence

I would like to create exclamations for a particular sentence using the java API? e.g. It's surprising == Isn't it surprising! e.g. It's cold == Isn't it cold! Are there any vendors or tools which help you generate exclamations, provided you give…
user339108
  • 12,613
  • 33
  • 81
  • 112
6
votes
2 answers

Language detection API/Library

Is there a service/library (free or paid) that takes a piece of text and return the language of it? I need to go over a million blog posts and determine their languages.
J. Dorian
  • 109
  • 3
6
votes
6 answers

Detecting similar paragraphs in two documents

I am trying to find similar paragraphs in 2 documents. Each document has many paragraphs of multiple lines of text. The text in paragraphs has some changes. The words can be inserted or deleted or miss-spelled. For example Doc1.Para This is one line…
uzair_syed
  • 313
  • 3
  • 16
6
votes
1 answer

Only ignore stop words for ngram_range=1

I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3). From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for…