Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
29
votes
3 answers

What is a good Java library for Parts-Of-Speech tagging?

I'm looking for a good open source POS Tagger in Java. Here's what I have come up with so far. LingPipe Stanford LBJ FastTag Anybody got any recommendations?
Glenn
  • 7,874
  • 3
  • 29
  • 38
29
votes
5 answers

How does Amazon's Statistically Improbable Phrases work?

How does something like Statistically Improbable Phrases work? According to amazon: Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify…
ʞɔıu
  • 47,148
  • 35
  • 106
  • 149
29
votes
5 answers

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear. In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z)…
Chelsea Wang
  • 599
  • 2
  • 5
  • 19
29
votes
1 answer

Pointwise mutual information on text

I was wondering how one would calculate the pointwise mutual information for text classification. To be more exact, I want to classify tweets in categories. I have a dataset of tweets (which are annotated), and I have a dictionary per category of…
Olivier_s_j
  • 5,490
  • 24
  • 80
  • 126
28
votes
12 answers

How can I split a text into sentences using the Stanford parser?

How can I split a text or paragraph into sentences using Stanford parser? Is there any method that can extract sentences, such as getSentencesFromString() as it's provided for Ruby?
S Gaber
  • 1,536
  • 7
  • 24
  • 43
28
votes
3 answers

Ease of use: Stanford CoreNLP vs. OpenNLP

I looking to use a suite of NLP tools for a personal project, and I was wondering whether Stanford's CoreNLP is easier to use or OpenNLP. Or is there another free package you would reccomend? I haven't really done any NLP before, so I am looking for…
Pratik Thaker
  • 637
  • 2
  • 10
  • 18
28
votes
4 answers

How to Train GloVe algorithm on my own corpus

I tried to follow this. But some how I wasted a lot of time ending up with nothing useful. I just want to train a GloVe model on my own corpus (~900Mb corpus.txt file). I downloaded the files provided in the link above and compiled it using cygwin…
Codir
  • 311
  • 1
  • 3
  • 7
28
votes
1 answer

Data sets for emotion detection in text

I'm implementing a system that could detect the human emotion in text. Are there any manually annotated data sets available for supervised learning and testing? Here are some interesting datasets: https://dataturks.com/projects/trending
ekka
  • 355
  • 1
  • 4
  • 11
28
votes
3 answers

Is it possible to train Stanford NER system to recognize more named entities types?

I'm using some NLP libraries now, (stanford and nltk) Stanford I saw the demo part but just want to ask if it possible to use it to identify more entity types. So currently stanford NER system (as the demo shows) can recognize entities as…
JudyJiang
  • 2,207
  • 6
  • 27
  • 47
28
votes
6 answers

POS tagging in German

I am using NLTK to extract nouns from a text-string starting with the following command: tagged_text = nltk.pos_tag(nltk.Text(nltk.word_tokenize(some_string))) It works fine in English. Is there an easy way to make it work for German as well? (I…
Johannes Meier
  • 285
  • 1
  • 3
  • 7
27
votes
1 answer

Understanding NLTK collocation scoring for bigrams and trigrams

Background: I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most…
ccgillett
  • 4,511
  • 4
  • 21
  • 14
27
votes
3 answers

How to build semantic search for a given domain

There is a problem we are trying to solve where we want to do a semantic search on our set of data, i.e we have a domain-specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a…
27
votes
1 answer

Parsing city of origin / destination city from a string

I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ultimately have two new columns titled 'origin' and…
Merv Merzoug
  • 1,149
  • 2
  • 19
  • 33
27
votes
3 answers

Combining a Tokenizer into a Grammar and Parser with NLTK

I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar. My goal is to build a grammar for a particular text corpus. (Initial question: Should I even try…
speedplane
  • 15,673
  • 16
  • 86
  • 138
27
votes
4 answers

How to speed up Gensim Word2vec model load time?

I'm building a chatbot so I need to vectorize the user's input using Word2Vec. I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300). So I load the model using Gensim: import gensim model =…
Marcus Holm
  • 417
  • 3
  • 7
  • 15