Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

binary (binary classification)
one category out of k possible categories (multi-class)
a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions

votes

3 answers

Text classification beyond the keyword dependency and inferring the actual meaning

I am trying to develop a text classifier that will classify a piece of text as Private or Public. Take medical or health information as an example domain. A typical classifier that I can think of considers keywords as the main distinguisher, right?…

python text-classification nlp

asked Mar 04 '19 at 22:00

Nuhil Mehdy

2,424
1
21
23

votes

3 answers

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below. import numpy as np import pandas as pd from…

python scikit-learn sparse-matrix text-classification tf-idf

asked Oct 24 '18 at 15:07

ongenz

votes

2 answers

How to do Text classification using word2vec

I want to perform text classification using word2vec. I got vectors of words. ls = [] sentences = lines.split(".") for i in sentences: ls.append(i.split()) model = Word2Vec(ls, min_count=1, size = 4) words =…

python-3.x word2vec gensim text-classification

asked Apr 04 '18 at 06:10

Shubham Agrawal

votes

0 answers

McNemar's test in Python and comparison of classification machine learning models

Is there a good McNemar's test implemented in Python? I don't see it anywhere in Scipy.stats or Scikit-Learn. I may have overlooked some other good packages. Please recommend. McNemar's Test is almost THE test for comparing two classification…

python machine-learning statistics classification text-classification

asked Jan 08 '17 at 23:43

Yo Hsiao

votes

3 answers

Which algorithms to use for one class classification?

I have over 15000 text docs of a specific topic. I would like to build a language model based on the former so that I can present to this model new random text documents of various topics and the algorithms tells if the new doc is of the same…

scikit-learn text-classification

asked Oct 23 '13 at 20:40

Adam Wayland

votes

2 answers

How to get all documents per topic in bertopic modeling

I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic. topic_model = BERTopic(verbose=True,…

nlp text-classification bert-language-model topic-modeling

asked Oct 27 '21 at 14:52

Kaleem

votes

1 answer

why take the first hidden state for sequence classification (DistilBertForSequenceClassification) by HuggingFace

In the last few layers of sequence classification by HuggingFace, they took the first hidden state of the sequence length of the transformer output to be used for classification. hidden_state = distilbert_output[0] # (bs, seq_len, dim) <--…

time-series sequence tensorflow2.0 text-classification huggingface-transformers

asked Feb 06 '20 at 04:10

doe

votes

1 answer

How to use SHAP with a linear SVC model from sklearn using Pipeline?

I am doing text classification using a linear SVC model from sklearn. Now I want to visualize which words/tokens have the highest impact on the classification decision by using SHAP (https://github.com/slundberg/shap). Right now this does not work…

scikit-learn pipeline text-classification svc shap

asked Apr 26 '19 at 12:39

translater

votes

1 answer

How to split data (raw text) into test/train sets with scikit crossvalidation module?

I have a large corpus of opinions (2500) in raw text. I would like to use scikit-learn library to split them into test/train sets. What could be the best aproach to solve this task with scikit-learn?. Could anybody provide me an example of spliting…

machine-learning scikit-learn classification cross-validation text-classification

asked Sep 11 '14 at 17:44

anon

votes

1 answer

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the…

python machine-learning nlp nltk text-classification

asked Feb 23 '20 at 06:05

Nishant Agarwal

votes

1 answer

How can I get around Keras pad_sequences() rounding float values to zero?

So I have a text classification model built with Keras. I've been trying to pad my varying length sequences but the Keras function pad_sequences() has just returned zeros. I've figured out that if you have a numpy array like the one below, it works…

python numpy keras lstm text-classification

asked Jan 03 '19 at 23:21

th4t gi

votes

1 answer

Reloading Keras Tokenizer during Testing

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) However, I modified the code to be able to save the generated model through h5py. Thus, after running the training script, I have a…

tensorflow keras tokenize text-classification word-embedding

asked Jun 26 '17 at 13:31

Vandenn

votes

1 answer

What is the difference between gensim LabeledSentence and TaggedDocument

Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog! class…

gensim text-classification word2vec doc2vec

asked Dec 16 '16 at 10:33

Rashmi Singh

votes

1 answer

Vocabulary Processor function

I am researching about embedding input for Convolution Neural Network and I understand Word2vec. However, in CNN text classification. dennybritz used function learn.preprocessing.VocabularyProcessor. In the document. They said it Maps documents to…

python tensorflow text-classification

asked Oct 03 '16 at 05:24

ngoduyvu

votes

5 answers

Detecting random keyboard hits considering QWERTY keyboard layout

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout". Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh Is there…

algorithm n-gram qwerty text-classification

asked Sep 27 '10 at 08:41

Nicolas Raoul

58,567
58
222
373

Prev 1 2

…

99 100 Next