Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Text pre-processing
Coreference resolution
Dependency parsing parse-tree
Document summarization summarization
Named entity recognition (NER) named-entity-recognition
Information extraction (IE) information-retrieval information-extraction
Language modeling
Part-of-speech (POS) tagging part-of-speech
Morphological analysis and wordform generation
Phrase-structure (constituency) parsing parse-tree
Machine translation (MT) machine-translation
Question answering (QA) nlp-question-answering
Sentiment analysis sentiment-analysis
Semantic parsing semantic-analysis
Text categorization text-classification document-classification
Textual entailment detection
Topic modeling topic-modeling
Word Sense Disambiguation (WSD) word-sense-disambiguation

Beginner books on Natural Language Processing

Popular software packages

General purpose toolkits
- Natural Language Toolkit (NLTK) (Python) nltk
- OpenNLP (Java) opennlp
- Sharp NLP (.NET) sharpnlp
- ClearNLP (Java) clearnlp
- Mate (Java)
- Stanford CoreNLP (Java) stanford-nlp
- Treat (Ruby)
- Mallet (Java) mallet
- spaCy (Python) spacy
- Pattern (Python) python-pattern
Phrase structure parsers
- Stanford Parser (Java) stanford-nlp
- Berkeley Parser (Java)
- BLLIP (Charniak-Johnson) Parser (C++, Python) charniak-parser
Dependency parsers
- Stanford Dependencies (packaged with Stanford parser) (Java) stanford-nlp
- MaltParser (Java)
- MSTParser (Java)
- UDPipe
Proof reading software
- LanguageTool (Java) languagetool

20185 questions

votes

1 answer

Measure similarity between two documents using Doc2Vec

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one. Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc…

python machine-learning nlp gensim doc2vec

asked Nov 27 '18 at 15:34

Borislav Stoilov

3,247
2
21
46

votes

3 answers

natural language processing fix for combined words

I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like 'the dog' is combine into 'thedog'. I checked the ascii and unicode string to…

regex nlp

asked Mar 15 '11 at 23:41

rich

votes

2 answers

Custom sentence segmentation in Spacy

I want spaCy to use the sentence segmentation boundaries as I provide instead of its own processing. For example: get_sentences("Bob meets Alice. @SentBoundary@ They play together.") # => ["Bob meets Alice.", "They play together."] # two…

python nlp spacy

asked Sep 22 '18 at 16:03

Harsh Trivedi

1,594
14
27

votes

2 answers

Sklearn Pipeline ValueError: could not convert string to float

I'm playing around with sklearn and NLP for the first time, and thought I understood everything I was doing up until I didn't know how to fix this error. Here is the relevant code (largely adapted from…

python scikit-learn nlp text-classification

asked Aug 31 '18 at 21:59

Mike

votes

1 answer

Find the similarity between two string columns of a DataFrame

I am new to programming.I have a pandas data frame in which two string columns are present. Data frame is like below: Col-1 Col-2 Update have a account Account account summary AccountDTH Cancel Balance …

python string pandas nlp similarity

asked Aug 20 '18 at 15:57

PANDA

votes

5 answers

Mallet topic modelling

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory…

java nlp machine-learning mallet

asked Mar 02 '11 at 13:48

fayaz

votes

1 answer

WordNetLemmatizer: Different handling of wn.ADJ and wn.ADJ_SAT?

I need to lemmatize text using nltk. In order to do this, I apply nltk.pos_tag to each sentence and then convert the resulting Penn Treebank tags (http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to WordNet tags. I need to…

python nlp nltk wordnet lemmatization

asked Aug 01 '18 at 13:17

Simon Hessner

1,757
1
22
49

votes

1 answer

Extracting nouns from Noun Phase in NLP

Could anyone please tell me how to extract only the nouns from the following output: I have tokenized and parsed the string "Give me the review of movie" based on a given grammar using following…

python django nlp

asked Feb 28 '11 at 15:14

Amanda

votes

1 answer

Why my tensorflow model outputs become NaN after x epochs?

After 85 epochs the loss (a cosine distance) of my model (a RNN with 3 LSTM layers) become NaN. Why does it happen and how can I fix it? Outputs of my model also become NaN. My model : tf.reset_default_graph() seqlen = tf.placeholder(tf.int32,…

python tensorflow nlp deep-learning

asked Jun 26 '18 at 08:26

François MENTEC

1,150
4
12
25

votes

1 answer

Which romanization standard should be used to improve ICU4j transliteration for Arabic-Latin?

We have a requirement to transliterate Arabic text to Latin characters(without diacritical marks) and display them to users. We are currently using IBM ICU4j for this. The API doesn't trasliterate well the Arabic text into proper readable latin…

java nlp transliteration transcription icu4j

asked Jun 20 '18 at 07:12

Kamlesh Sharma

votes

2 answers

How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages: Page from gensim github issues section. It…

python nlp gensim doc2vec

asked Jun 05 '18 at 09:48

Ghaliamus

votes

1 answer

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences: The corpora were lemmatized and POS-tagged with the…

nlp stanford-nlp word2vec gensim doc2vec

asked May 29 '18 at 12:03

Simon Hessner

1,757
1
22
49

votes

3 answers

Determine if a text extract from spacy is a complete sentence

We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way…

python nlp

asked May 21 '18 at 18:43

CrabbyPete

votes

1 answer

I cannot understand the skipgrams() function in keras

I am trying to understand the skipgrams() function in keras by using the following code from keras.preprocessing.text import * from keras.preprocessing.sequence import skipgrams text = "I love money" #My test sentence tokenizer =…

python machine-learning nlp keras text-processing

asked May 15 '18 at 10:06

Raven Cheuk

2,903
4
27
54

votes

1 answer

Remove Spacy downloaded model

After downloading and linking a spacy model (en large) by: python -m spacy download en_core_web_lg which is around 850 Mb of data. How can it find and delete the data (downloaded model) on my mac to free some space? Spacy: 2.0.18 Python: 3.6.9 …

python pip nlp spacy

asked May 15 '18 at 07:09

Carson Yau

Prev 1 2 3

…

99 100 Next