Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
6
votes
1 answer

Measure similarity between two documents using Doc2Vec

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one. Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc…
Borislav Stoilov
  • 3,247
  • 2
  • 21
  • 46
6
votes
3 answers

natural language processing fix for combined words

I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like 'the dog' is combine into 'thedog'. I checked the ascii and unicode string to…
rich
  • 595
  • 1
  • 7
  • 15
6
votes
2 answers

Custom sentence segmentation in Spacy

I want spaCy to use the sentence segmentation boundaries as I provide instead of its own processing. For example: get_sentences("Bob meets Alice. @SentBoundary@ They play together.") # => ["Bob meets Alice.", "They play together."] # two…
Harsh Trivedi
  • 1,594
  • 14
  • 27
6
votes
2 answers

Sklearn Pipeline ValueError: could not convert string to float

I'm playing around with sklearn and NLP for the first time, and thought I understood everything I was doing up until I didn't know how to fix this error. Here is the relevant code (largely adapted from…
Mike
  • 85
  • 1
  • 9
6
votes
1 answer

Find the similarity between two string columns of a DataFrame

I am new to programming.I have a pandas data frame in which two string columns are present. Data frame is like below: Col-1 Col-2 Update have a account Account account summary AccountDTH Cancel Balance …
PANDA
  • 137
  • 2
  • 9
6
votes
5 answers

Mallet topic modelling

I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory…
fayaz
  • 61
  • 2
6
votes
1 answer

WordNetLemmatizer: Different handling of wn.ADJ and wn.ADJ_SAT?

I need to lemmatize text using nltk. In order to do this, I apply nltk.pos_tag to each sentence and then convert the resulting Penn Treebank tags (http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) to WordNet tags. I need to…
Simon Hessner
  • 1,757
  • 1
  • 22
  • 49
6
votes
1 answer

Extracting nouns from Noun Phase in NLP

Could anyone please tell me how to extract only the nouns from the following output: I have tokenized and parsed the string "Give me the review of movie" based on a given grammar using following…
Amanda
  • 79
  • 2
  • 3
6
votes
1 answer

Why my tensorflow model outputs become NaN after x epochs?

After 85 epochs the loss (a cosine distance) of my model (a RNN with 3 LSTM layers) become NaN. Why does it happen and how can I fix it? Outputs of my model also become NaN. My model : tf.reset_default_graph() seqlen = tf.placeholder(tf.int32,…
François MENTEC
  • 1,150
  • 4
  • 12
  • 25
6
votes
1 answer

Which romanization standard should be used to improve ICU4j transliteration for Arabic-Latin?

We have a requirement to transliterate Arabic text to Latin characters(without diacritical marks) and display them to users. We are currently using IBM ICU4j for this. The API doesn't trasliterate well the Arabic text into proper readable latin…
Kamlesh Sharma
  • 222
  • 1
  • 7
  • 23
6
votes
2 answers

How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?

I'm trying to get the text with its punctuation as it is important to consider the latter in my doc2vec model. However, the wikicorpus retrieve only the text. After searching the web I found these pages: Page from gensim github issues section. It…
Ghaliamus
  • 101
  • 1
  • 4
6
votes
1 answer

NLP: Pre-processing in doc2vec / word2vec

A few papers on the topics of word and document embeddings (word2vec, doc2vec) mention that they used the Stanford CoreNLP framework to tokenize/lemmatize/POS-tag the input words/sentences: The corpora were lemmatized and POS-tagged with the…
Simon Hessner
  • 1,757
  • 1
  • 22
  • 49
6
votes
3 answers

Determine if a text extract from spacy is a complete sentence

We are working on sentences extracted from a PDF. The problem is that it includes the title, footers, table of contents, etc. Is there a way to determine if the sentence we get when pass the document to spacy is a complete sentence. Is there a way…
CrabbyPete
  • 505
  • 9
  • 18
6
votes
1 answer

I cannot understand the skipgrams() function in keras

I am trying to understand the skipgrams() function in keras by using the following code from keras.preprocessing.text import * from keras.preprocessing.sequence import skipgrams text = "I love money" #My test sentence tokenizer =…
Raven Cheuk
  • 2,903
  • 4
  • 27
  • 54
6
votes
1 answer

Remove Spacy downloaded model

After downloading and linking a spacy model (en large) by: python -m spacy download en_core_web_lg which is around 850 Mb of data. How can it find and delete the data (downloaded model) on my mac to free some space? Spacy: 2.0.18 Python: 3.6.9 …
Carson Yau
  • 489
  • 3
  • 9
  • 17