Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

  • binary (binary classification)
  • one category out of k possible categories (multi-class)
  • a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions
5
votes
2 answers

Get corresponding classes to predict_proba (GridSearchCV sklearn)

I'm using GridSearchCV and a pipeline to classify some text documents. A code snippet: clf = Pipeline([('vect', TfidfVectorizer()), ('clf', SVC())]) parameters = {'vect__ngram_range' : [(1,2)], 'vect__min_df' : [2], 'vect__stop_words' :…
Josefine
  • 181
  • 1
  • 10
5
votes
2 answers

Using Topic Model, how should we set up a "stop words" list?

There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case? For example, I have 10K of articles from a journal, then because of the structure of an…
Ruby
  • 284
  • 1
  • 5
  • 18
5
votes
2 answers

How do I transform text into TF-IDF format using Weka in Java?

Suppose, I have following sample ARFF file with two attributes: (1) sentiment: positive [1] or negative [-1] (2) tweet: text @relation sentiment_analysis @attribute sentiment {1, -1} @attribute tweet string @data -1,'is upset that he can\'t update…
5
votes
1 answer

How to use pickled classifier with countVectorizer.fit_transform() for labeling data

I trained a classifier on a set of short documents and pickled it after getting the reasonable f1 and accuracy scores for a binary classification task. While training, I reduced the number of features using a sciki-learn countVectorizer cv: cv…
Gaurav Tuli
  • 53
  • 1
  • 5
5
votes
2 answers

Lexicon dictionary for synonym words

There are few dictionaries available for natural language processing. Like positive, negative words dictionaries etc. Is there any dictionary available which contains list of synonym for all dictionary words? Like for nice synonyms: enjoyable,…
5
votes
1 answer

Can you recommend a package in R that can be used to count precision, recall and F1-score for multi class classification tasks

Is there any package that you would recommend which can be used to calculate the precision, F1, recall for multi class classification task in R. I tried to use ROCR but it states that: ROCR currently supports only evaluation of binary…
tanay
  • 468
  • 5
  • 16
4
votes
1 answer

Fine-tuning a pretrained Spanish RoBERTa model for a different task, sentiment analysis

I'm doing sentiment analysis of Spanish tweets. After reviewing some of the recent literature, I've seen that there's been a most recent effort to train a RoBERTa model exclusively on Spanish text (roberta-base-bne). It seems to perform better than…
4
votes
1 answer

Resampling dataset for spam classification

I have a class imbalance problem with the following dataset: Text is_it_capital? is_it_upper? contains_num? Label an example of text 0 0 0 …
4
votes
1 answer

ALBERT not converging - HuggingFace

I'm trying to apply a pretrained HuggingFace ALBERT transformer model to my own text classification task, but the loss is not decreasing beyond a certain point. Here's my code: There are four labels in my text classification dataset which are: 0, 1,…
4
votes
2 answers

Spacy TextCat Score in MultiLabel Classfication

In the spacy's text classification train_textcat example, there are two labels specified Positive and Negative. Hence the cats score is represented as cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels] I am working with…
4
votes
1 answer

FastText 0.9.2 - why is recall 'nan'?

I trained a supervised model in FastText using the Python interface and I'm getting weird results for precision and recall. First, I trained a model: model = fasttext.train_supervised("train.txt", wordNgrams=3, epoch=100,…
4
votes
0 answers

How to handle text classification model that gives few results with higher confidence to wrong category?

I had a dataset of 15k records. I trained the model using a k-train package and 'bert' model with 5k samples. The train-test split is 70-30% and test results gave me accuracy and f1 scores as 93-94%. I felt the model is well trained, But on…
4
votes
1 answer

Difference between blank and pretrained models in spacy

I am currently trying to train a text classifier using spacy and I got stuck with following question: what is the difference between creating a blank model using spacy.blank('en') and using a pretrained model spacy.load('en_core_web_sm'). Just to…
Oleg Ivanytskyi
  • 959
  • 2
  • 12
  • 28
4
votes
1 answer

How to make a prediction as binary output? - Python (Tensorflow)

I'm learning text classification using movie reviews as data with tensorflow, but I got stuck when I get an output prediction different (not rounded, not binary) to the label. CODE predict = model.predict([test_review]) print("Prediction: " +…
Y4RD13
  • 937
  • 1
  • 16
  • 42
4
votes
3 answers

Receiving, "An error was thrown and was not caught: The validation data provided must contain ..." when creating a Text Classifier Model with CreateML

I am using Playground to create a Text Classifier Model using CreateML and keep getting the error: Playground execution terminated: An error was thrown and was not caught: ▿ The validation data provided must contain class. ▿ type : 1 element -…
Jerry Rufe
  • 43
  • 4