Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
31
votes
1 answer

What meaning does the length of a Word2vec vector have?

I am using Word2vec through gensim with Google's pretrained vectors trained on Google News. I have noticed that the word vectors I can access by doing direct index lookups on the Word2Vec object are not unit vectors: >>> import numpy >>> from…
Mark Amery
  • 143,130
  • 81
  • 406
  • 459
31
votes
2 answers

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this…
Luca Fiaschi
  • 3,145
  • 7
  • 31
  • 44
31
votes
2 answers

How do I test whether an nltk resource is already installed on the machine running my code?

I just started my first NLTK project and am confused about the proper setup. I need several resources like the Punkt Tokenizer and the maxent pos tagger. I myself downloaded them using the GUI nltk.download(). For my collaborators I of course want…
Zakum
  • 2,157
  • 2
  • 22
  • 30
31
votes
1 answer

How can I do Train And Test step in Giza++?

In artificial intelligence methods we have two stages of training. These stages are data and testing. In the training stage we give a huge amount of data to a system and we normally test it with smaller volume of data. Then we evaluate the…
m-Abrontan
  • 503
  • 4
  • 7
30
votes
6 answers

Stack Overflow Related questions algorithm

The related questions that appear after entering the title, and those that are in the right side bar when viewing a question seem to suggest very apt questions. Stack Overflow only does a SQL search for it and uses no special algorithms, said…
lprsd
  • 84,407
  • 47
  • 135
  • 168
30
votes
3 answers

Selecting the most fluent text from a set of possibilities via grammar checking (Python)

Some background I am a literature student at New College of Florida, currently working on an overly ambitious creative project. The project is geared towards the algorithmic generation of poetry. It's written in Python. My Python knowledge and…
floer32
  • 2,190
  • 4
  • 29
  • 50
30
votes
4 answers

Generating questions from text (NLP)

What approaches are there to generating question from a sentence? Let's say I have a sentence "Jim's dog was very hairy and smelled like wet newspaper" - which toolkit is capable of generating a question like "What did Jim's dog smelled like?" or…
Daniel Protopopov
  • 6,778
  • 3
  • 23
  • 39
30
votes
8 answers

How to auto-tag content, algorithms and suggestions needed

I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all. I am now searching for ways to help me tag these articles with somewhat descriptive tags. All these articles is…
Kasper Grubbe
  • 923
  • 2
  • 14
  • 19
30
votes
6 answers

How to cluster similar sentences using BERT

For ElMo, FastText and Word2Vec, I'm averaging the word embeddings within a sentence and using HDBSCAN/KMeans clustering to group similar sentences. A good example of the implementation can be seen in this short article:…
30
votes
6 answers

How to perform Lemmatization in R?

This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external…
StrikeR
  • 1,598
  • 5
  • 18
  • 35
30
votes
10 answers

Python - RegEx for splitting text into sentences (sentence-tokenizing)

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence…
user3590149
  • 1,525
  • 7
  • 22
  • 25
30
votes
5 answers

How can I use NLP to parse recipe ingredients?

I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in…
Greg
  • 7,233
  • 12
  • 42
  • 53
29
votes
3 answers

How to get all article pages under a Wikipedia Category and its sub-categories?

I want to get all the articles names under a category and its sub-categories. Options I'm aware of: Using the Wikipedia API. Does it have such an option?? d/l the dump. Which format would be better for my usage? There is also an option to search…
Noam
  • 3,341
  • 4
  • 35
  • 64
29
votes
4 answers

What is the difference between Dialogflow bot framework vs Rasa nlu bot framework?

What is the difference between Dialogflow bot framework vs Rasa nlu bot framework ?Any other open source frameworks available in market with NLP support?
balaji
  • 293
  • 1
  • 3
  • 8
29
votes
3 answers

How can a tree be encoded as input to a neural network?

I have a tree, specifically a parse tree with tags at the nodes and strings/words at the leaves. I want to pass this tree as input into a neural network all the while preserving its structure. Current approach Assume we have some dictionary of words…