Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

  • binary (binary classification)
  • one category out of k possible categories (multi-class)
  • a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions
4
votes
2 answers

How Information Gain Works in Text Classification

I have to learn information gain for feature selection right now, But I don't have clear comprehension about it. I am a newbie, and I'm confused about it. How to use IG in feature selection (manual calculation)? I just have clue this .. That have…
4
votes
1 answer

scikit-learn classification using doc2vec representation

I want to classify text documents using doc2vec representation and scikit-learn models. My problem is that I'm lost on how to get started. can someone explain the general steps usually taken to use doc2vec with scikit-learn?
4
votes
1 answer

R: how to use random forests to predict binary outcome using string variables?

Consider the following dataframe outcome <- c(1,0,0,1,1) string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness') df = df=data.frame(outcome,string) > df outcome string 1 1 I love pasta 2 0 …
4
votes
3 answers

Text classification using e1071 (SVM)

I have a dataframe having two columns. One Column contains text. Each row of that column one contains some type of data of three different classes(skill,qualification,experience) and other column is their respective class labels. Snapshot of the…
user2252882
4
votes
1 answer

Addressing synonyms in Supervised Learning for Text Classification

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for…
4
votes
3 answers

Setting up a MLP for binary classification with tensorflow

I have some troubles trying to set up a multilayer perceptron for binary classification using tensorflow. I have a very large dataset (about 1,5*10^6 examples) each with a binary (0/1) label and 100 features. What I need to do is to set up a simple…
4
votes
1 answer

Using Keras for text classification

I am struggling to approach the bag of words / vocabulary method for representing my input data as one hot vectors for my neural net model in keras. I would like to build a simple 3 layer network but I need help in understanding and developing an…
Moey Zf
  • 41
  • 1
  • 2
4
votes
1 answer

Text2Vec classification with caret problems

Some context: Working with text classification and big sparse matrices in R I have been working on a text multi-class classification problem with the text2vec package and caret. The plan is to use text2vec for building the document-term matrix,…
Ed.
  • 846
  • 6
  • 24
4
votes
1 answer

Issues using scikit to for multi-label data

Im using the following code for Multi-label data classification :- import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from…
4
votes
1 answer

How to change data of a corpus to appropriate format for training with 'caret' package in R?

Q-1. How to change data of a corpus to appropriate format for training with 'caret' package? First of all, i would like to give you some environments for this question and i will be show you where i am stuck. Environments This is corpus that is…
user5152421
4
votes
1 answer

Sklearn other inputs in addition to text for text classification

I am trying to do a text classifier using "Sci kit" learn bag of words. Vectorization into a classifier. However, I was wondering how would i add another variable to the input apart from the text itself. Say I want to add a number of words in the…
4
votes
1 answer

TextClassification with TextBlob

I'm a complete newbie in Machine Learning, NLP, Data Analysis but I'm very motivated to understand it better. I'm reading couple of books on NLTK, scikit-learn etc. I discovered a python module "TextBlob" and found it to be super easy to get started…
4
votes
1 answer

"Combine" TF-IDF scores for single class of documents within corpus

Let's say I've calculated the TF-IDF scores for a corpus of documents, resulting in a matrix of TF-IDF features. If a subset of those documents are of a certain class, can I somehow "combine" the scores of that subset to get a single value for each…
Andrew LaPrise
  • 3,373
  • 4
  • 32
  • 50
4
votes
2 answers

Detect (predefined) topics in natural text

Is there a library or database out there that can detect the topics of natural text? I'm not talking about generating topics from extracted keywords, but about analysing the used vocabulary and matching it with predefined topics. Like searching for…
snøreven
  • 1,904
  • 2
  • 19
  • 39
4
votes
1 answer

How to correctly override and call super-method in Python

First, the problem at hand. I am writing a wrapper for a scikit-learn class, and am having problems with the right syntax. What I am trying to achieve is an override of the fit_transform function, which alters the input only slightly, and then calls…
Arne
  • 17,706
  • 5
  • 83
  • 99