Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
6
votes
2 answers

Extract grocery list out of free text

I am looking for a python library / algorithm / paper to extract a list of groceries out of free text. For example: "One salad and two beers" Should be converted to: {'salad':1, 'beer': 2}
Uri Goren
  • 13,386
  • 6
  • 58
  • 110
6
votes
2 answers

Defining vocabulary size in text classification

I have a question regarding the defining of vocabulary set needed for feature extraction in text classification. In an experiment, there are two approaches I can think of: 1.Define vocabulary size using both training data and test data, so that no…
antande
  • 169
  • 1
  • 13
6
votes
3 answers

Named Entity Recognition with Syntaxnet

I am trying to understand and learn SyntaxNet. I am trying to figure out whether is there any way to use SyntaxNet for Name Entity Recognition of a corpus. Any sample code or helpful links would be appreciated.
Anantha
  • 99
  • 4
  • 13
6
votes
1 answer

NLTK - Download all nltk data except corpara from command line without Downloader UI

We can download all nltk data using: > import nltk > nltk.download('all') Or specific data using: > nltk.download('punkt') > nltk.download('maxent_treebank_pos_tagger') But I want to download all data except 'corpara' files, for example - all…
RAVI
  • 3,143
  • 4
  • 25
  • 38
6
votes
1 answer

How to get constituency-based parse tree from Parsey McParseface

Parsey McParsey returns a dependency-based parse tree by default, but is their a way to get a constituency-based parse tree from it? EDIT: To clarify, by "to get from it" I mean from the Parsey itself. Though building a tree from ConLL output would…
maga
  • 720
  • 3
  • 13
6
votes
3 answers

How can I split at word boundaries with regexes?

I'm trying to do this: import re sentence = "How are you?" print(re.split(r'\b', sentence)) The result being [u'How are you?'] I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?
oarfish
  • 4,116
  • 4
  • 37
  • 66
6
votes
3 answers

Regular expression for counting sentences in a block of text

Possible Duplicate: PHP - How to split a paragraph into sentences. I have a block of text that I would like to separate into sentences, what would be the best way of doing this? I thought of looking for '.','!','?' characters, but I realized…
GSto
  • 41,512
  • 37
  • 133
  • 184
6
votes
1 answer

Name Entity Resolution Algorithm

I was trying to build an entity resolution system, where my entities are, (i) General named entities, that is organization, person, location,date, time, money, and percent. (ii) Some other entities like, product, title of person like president,ceo,…
Coeus2016
  • 355
  • 4
  • 14
6
votes
1 answer

Multi-Threaded NLP with Spacy pipe

I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example: from spacy.en import English input = open("big_file.txt") big_text=…
Sajjad Bay
  • 197
  • 2
  • 9
6
votes
1 answer

Is it possible to returned the analyzed fields in an ElasticSearch >2.0 search?

This question feels very similar to an old question posted here: Retrieve analyzed tokens from ElasticSearch documents, but to see if there are any changes I thought it would make sense to post it again for the latest version of ElasticSearch. We…
luckylwk
  • 225
  • 1
  • 8
6
votes
1 answer

Using different word2vec training data in spaCy

So I'd like to use some of this training data in spaCy when I use the similarity() method. I'd also like to maybe use the pre-trained vectors also on this page. But the spaCy docs seem lacking here, does anyone know how to do this?
Tom Carrick
  • 6,349
  • 13
  • 54
  • 78
6
votes
1 answer

Intuition behind tf-idf for term extraction

I'm trying to build a dictionary of words using tf-idf. However, intuitively it doesn't make sense. If the inverse document frequency (idf) part of tf-idf calculates the relevance of a term with respect to entire corpus, then that implies some of…
jCoder
  • 203
  • 3
  • 9
6
votes
2 answers

How to correct spelling in a Pandas DataFrame

Using the TextBlob library it is possible to improve the spelling of strings by defining them as TextBlob objects first and then using the correct method. Example: from textblob import TextBlob data = TextBlob('Two raods diverrged in a yullow waod…
RDJ
  • 4,052
  • 9
  • 36
  • 54
6
votes
1 answer

Why Stanford parser with nltk is not correctly parsing a sentence?

I am using Stanford parser with nltk in python and got help from Stanford Parser and NLTK to set up Stanford nlp libraries. from nltk.parse.stanford import StanfordParser from nltk.parse.stanford import StanfordDependencyParser parser =…
Nomiluks
  • 2,052
  • 5
  • 31
  • 53
6
votes
1 answer

Result Difference in Stanford NER tagger NLTK (python) vs JAVA

I am using both python and java to run the Stanford NER tagger but I am seeing the difference in the results. For example, when I input the sentence "Involved in all aspects of data modeling using ERwin as the primary software for this.", JAVA…
aerin
  • 20,607
  • 28
  • 102
  • 140