Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

  • binary (binary classification)
  • one category out of k possible categories (multi-class)
  • a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions
5
votes
2 answers

Naive Bayes in Quanteda vs caret: wildly different results

I'm trying to use the packages quanteda and caret together to classify text based on a trained sample. As a test run, I wanted to compare the build-in naive bayes classifier of quanteda with the ones in caret. However, I can't seem to get caret to…
JBGruber
  • 11,727
  • 1
  • 23
  • 45
5
votes
1 answer

How to resample text (imbalanced groups) in a pipeline?

I'm trying to do some text classification using MultinomialNB, but I'm running into problems because my data is unbalanced. (Below is some sample data for simplicity. In actuality, mine is much larger.) I'm trying to resample my data using…
5
votes
4 answers

How can a machine learning model handle unseen data and unseen label?

I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text…
5
votes
4 answers

Create ML Text Classifier probabilities

I am creating model with Create ML. I am using a JSON file. let data = try MLDataTable(contentsOf: URL(fileURLWithPath: "poems.json")) let (trainingData , testingData) = data.randomSplit(by: 0.8, seed: 0) let classifier = try…
P S
  • 527
  • 4
  • 18
5
votes
2 answers

LSTM Text Classification Bad Accuracy Keras

I'm going crazy in this project. This is multi-label text-classification with lstm in keras. My model is this: model = Sequential() model.add(Embedding(max_features, embeddings_dim, input_length=max_sent_len, mask_zero=True,…
5
votes
2 answers

SMOTE, Oversampling on text classification in Python

I am doing a text classification and I have very imbalanced data like Category | Total Records Cate1 | 950 Cate2 | 40 Cate3 | 10 Now I want to over sample Cate2 and Cate3 so it at least have 400-500 records, I prefer to use SMOTE over…
Vineet
  • 1,492
  • 4
  • 17
  • 31
5
votes
1 answer

How can I visualize border/decision function of two classes using scikit-learn

I am pretty new in machine learning, so I still don't understand how I can visualize the border between 2 classes in bag of words case. I found the following exaplpe to plot data plot a document tfidf 2D graph from sklearn.datasets import…
5
votes
2 answers

How do i build a model using Glove word embeddings and predict on Test data using text2vec in R

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target…
5
votes
1 answer

How to predict desired class using Naive Bayes in Text Classification

I have been implementing Multinomial Naive Bayes Classifier from scratch for text classification in python. I calculate the feature count for each classes and probability distributions for features. According to my implementation I get the…
5
votes
2 answers

Best machine learning approach to automate text/fuzzy matching

I'm reasonably new to machine learning, I've done a few projects in python. I'm looking for advice on how to approach the below problem which I believe could be automated. A user in a data quality team in my organisation has a daily task of taking a…
5
votes
2 answers

RNN for binary classification of sequence

I wondering if someone can suggest a good library or reference (tutorial or article) to implement a Recurrent Neural Network (RNN). I tried to use the rnnlib by Alex Graves, but I had some troubles in changing the architecture to adapt the network…
5
votes
1 answer

Adding Special Case Idioms to Python Vader Sentiment

I've been using Vader Sentiment to do some text sentiment analysis and I noticed that my data has a lot of "way to go" phrases that were incorrectly being classified as neutral: In[11]: sentiment('way to go John') Out[11]: {'compound': 0.0, 'neg':…
Jason
  • 2,834
  • 6
  • 31
  • 35
5
votes
2 answers

Large classification document corpus

Can anyone point me to some large corpus that I use for classification? But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that. I was able only to find this Reuters and 20…
Kobe-Wan Kenobi
  • 3,694
  • 2
  • 40
  • 67
5
votes
1 answer

How to use spark Naive Bayes classifier for text classification with IDF?

I want to convert text documents into feature vectors using tf-idf, and then train a naive bayes algorithm to classify them. I can easily load my text files without the labels and use HashingTF() to convert it into a vector, and then use IDF() to…
5
votes
2 answers

SMOTE oversampling and cross-validation

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the…
kverr
  • 51
  • 1
  • 2