Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

  • binary (binary classification)
  • one category out of k possible categories (multi-class)
  • a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions
10
votes
3 answers

Text classification beyond the keyword dependency and inferring the actual meaning

I am trying to develop a text classifier that will classify a piece of text as Private or Public. Take medical or health information as an example domain. A typical classifier that I can think of considers keywords as the main distinguisher, right?…
Nuhil Mehdy
  • 2,424
  • 1
  • 21
  • 23
10
votes
3 answers

Select top n TFIDF features for a given document

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below. import numpy as np import pandas as pd from…
ongenz
  • 890
  • 1
  • 10
  • 20
10
votes
2 answers

How to do Text classification using word2vec

I want to perform text classification using word2vec. I got vectors of words. ls = [] sentences = lines.split(".") for i in sentences: ls.append(i.split()) model = Word2Vec(ls, min_count=1, size = 4) words =…
Shubham Agrawal
  • 109
  • 1
  • 1
  • 4
10
votes
0 answers

McNemar's test in Python and comparison of classification machine learning models

Is there a good McNemar's test implemented in Python? I don't see it anywhere in Scipy.stats or Scikit-Learn. I may have overlooked some other good packages. Please recommend. McNemar's Test is almost THE test for comparing two classification…
10
votes
3 answers

Which algorithms to use for one class classification?

I have over 15000 text docs of a specific topic. I would like to build a language model based on the former so that I can present to this model new random text documents of various topics and the algorithms tells if the new doc is of the same…
Adam Wayland
  • 354
  • 1
  • 2
  • 9
9
votes
2 answers

How to get all documents per topic in bertopic modeling

I have a dataset and trying to convert it to topics using berTopic modeling but the problem is, i cant get all the docoments of a topic. berTopic is only return 3 docoments per topic. topic_model = BERTopic(verbose=True,…
9
votes
1 answer

why take the first hidden state for sequence classification (DistilBertForSequenceClassification) by HuggingFace

In the last few layers of sequence classification by HuggingFace, they took the first hidden state of the sequence length of the transformer output to be used for classification. hidden_state = distilbert_output[0] # (bs, seq_len, dim) <--…
9
votes
1 answer

How to use SHAP with a linear SVC model from sklearn using Pipeline?

I am doing text classification using a linear SVC model from sklearn. Now I want to visualize which words/tokens have the highest impact on the classification decision by using SHAP (https://github.com/slundberg/shap). Right now this does not work…
translater
  • 101
  • 1
  • 5
9
votes
1 answer

How to split data (raw text) into test/train sets with scikit crossvalidation module?

I have a large corpus of opinions (2500) in raw text. I would like to use scikit-learn library to split them into test/train sets. What could be the best aproach to solve this task with scikit-learn?. Could anybody provide me an example of spliting…
8
votes
1 answer

Generating dictionaries to categorize tweets into pre-defined categories using NLTK

I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area. I have extracted last 100 tweets of the…
8
votes
1 answer

How can I get around Keras pad_sequences() rounding float values to zero?

So I have a text classification model built with Keras. I've been trying to pad my varying length sequences but the Keras function pad_sequences() has just returned zeros. I've figured out that if you have a numpy array like the one below, it works…
th4t gi
  • 139
  • 2
  • 5
8
votes
1 answer

Reloading Keras Tokenizer during Testing

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) However, I modified the code to be able to save the generated model through h5py. Thus, after running the training script, I have a…
8
votes
1 answer

What is the difference between gensim LabeledSentence and TaggedDocument

Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog! class…
Rashmi Singh
  • 519
  • 1
  • 8
  • 20
8
votes
1 answer

Vocabulary Processor function

I am researching about embedding input for Convolution Neural Network and I understand Word2vec. However, in CNN text classification. dennybritz used function learn.preprocessing.VocabularyProcessor. In the document. They said it Maps documents to…
ngoduyvu
  • 241
  • 4
  • 16
8
votes
5 answers

Detecting random keyboard hits considering QWERTY keyboard layout

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout". Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh Is there…
Nicolas Raoul
  • 58,567
  • 58
  • 222
  • 373