Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
51
votes
5 answers

tag generation from a text content

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools. Additionally, I will be grateful if you point any Python based solution / library…
Hellnar
  • 62,315
  • 79
  • 204
  • 279
50
votes
14 answers

Load Pretrained glove vectors in python

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't know how to do it when it is a text file format.
Same
  • 759
  • 2
  • 9
  • 15
50
votes
3 answers

Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

I am working on keyword extraction problem. Consider the very general case from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english') t = """Two Travellers, walking in the noonday…
AbtPst
  • 7,778
  • 17
  • 91
  • 172
50
votes
6 answers

NLTK Named Entity Recognition with Custom Data

I'm trying to extract named entities from my text using NLTK. I find that NLTK NER is not very accurate for my purpose and I want to add some more tags of my own as well. I've been trying to find a way to train my own NER, but I don't seem to be…
user1502248
  • 501
  • 1
  • 4
  • 3
48
votes
7 answers

Unsupervised Sentiment Analysis

I've been reading a lot of articles that explain the need for an initial set of texts that are classified as either 'positive' or 'negative' before a sentiment analysis system will really work. My question is: Has anyone attempted just doing a…
Trindaz
  • 17,029
  • 21
  • 82
  • 111
46
votes
2 answers

Definition of downstream tasks in NLP

What does downstream tasks terminology mean in NLP? I saw this terminology used in several articles but I can't understand the idea behind it.
KF2
  • 9,887
  • 8
  • 44
  • 77
46
votes
4 answers

Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?

Using NLTK and WordNet, how do I convert simple tense verb into its present, past or past participle form? For example: I want to write a function which would give me verb in expected form as follows. v = 'go' present = present_tense(v) print…
Software Enthusiastic
  • 25,147
  • 16
  • 58
  • 68
46
votes
7 answers

How to detect language of user entered text?

I am dealing with an application that is accepting user input in different languages (currently 3 languages fixed). The requirement is that users can enter text and dont bother to select the language via a provided checkbox in the UI. Is there an…
ManBugra
  • 1,289
  • 2
  • 14
  • 20
46
votes
4 answers

How to use Gensim doc2vec with pre-trained word vectors?

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec? Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector…
Stergios
  • 3,126
  • 6
  • 33
  • 55
45
votes
4 answers

TFIDF for Large Dataset

I have a corpus which has around 8 million news articles, I need to get the TFIDF representation of them as a sparse matrix. I have been able to do that using scikit-learn for relatively lower number of samples, but I believe it can't be used for…
apurva.nandan
  • 1,061
  • 1
  • 11
  • 19
45
votes
5 answers

Algorithms to detect phrases and keywords from text

I have around 100 megabytes of text, without any markup, divided to approximately 10,000 entries. I would like to automatically generate a 'tag' list. The problem is that there are word groups (i.e. phrases) that only make sense when they are…
Kimvais
  • 38,306
  • 16
  • 108
  • 142
44
votes
5 answers

How to remove the error "SystemError: initialization of _internal failed without raising an exception"

I am trying to import Top2Vec package for nlp topic modelling. But even after upgrading pip, numpy this error is coming. I tried pip install --upgrade pip pip install --upgrade numpy I was expecting to run from top2vec import Top2Vec model =…
Sayonita Ghosh Roy
  • 441
  • 1
  • 3
  • 3
44
votes
4 answers

Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as…
Karussell
  • 17,085
  • 16
  • 97
  • 197
44
votes
9 answers

CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

I got the following error when I ran my PyTorch deep learning model in Google Colab /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias) 1370 ret = torch.addmm(bias, input, weight.t()) 1371 …
Mr. NLP
  • 891
  • 1
  • 8
  • 20
44
votes
1 answer

Doc2Vec Get most similar documents

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a…
Clock Slave
  • 7,627
  • 15
  • 68
  • 109