Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

  • binary (binary classification)
  • one category out of k possible categories (multi-class)
  • a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions
4
votes
4 answers

How to find outliers in document classification with million documents?

I have million documents which belongs to different classes (100 classes). I want to find outlier documents in each class (which doesn't belong to that class but wrongly classified) and filter them. I can do document similarity using cosine…
4
votes
2 answers

How to do sequence classification with pytorch nn.Transformer?

I am doing a sequence classification task using nn.TransformerEncoder(). Whose pipeline is similar to nn.LSTM(). I have tried several temporal features fusion methods: Selecting the final outputs as the representation of the whole sequence. Using…
4
votes
1 answer

Finetuning BERT on custom data

I want to train a 21 class text classification model using Bert. But I have very little training data, so a downloaded a similar dataset with 5 classes with 2 million samples.t And finetuned downloaded data with uncased pretrained model provided by…
4
votes
2 answers

How to represent ELMo embeddings as a 1D array?

I am using the language model ELMo - https://allennlp.org/elmo to represent my text data as a numerical vector. This vector will be used as training data for a simple sentiment analysis task. In this case the data is not in english, so I downloaded…
4
votes
1 answer

Cannot freeze Tensorflow models into frozen(.pb) file

I am referring (here) to freeze models into .pb file. My model is CNN for text classification I am using (Github) link to train CNN for text classification and exporting in form of models. I have trained models to 4 epoch and My checkpoints folders…
4
votes
1 answer

python LightGBM text classicication with Tfidf

I'm trying to introduce LightGBM for text multiclassification. 2 columns in pandas dataframe, where 'category' and 'contents' are set as follows. Dataframe: contents category 1 this is example1... A 2 this is…
SY9
  • 165
  • 2
  • 11
4
votes
1 answer

Accuracy below 50% for binary classification

I am training a Naive Bayes classifier on a balanced dataset with equal number of positive and negative examples. At test time I am computing the accuracy in turn for the examples in the positive class, negative class, and the subsets which make up…
Crista23
  • 3,203
  • 9
  • 47
  • 60
4
votes
0 answers

How TF-IDF handles missing values?

I am working on a classification problem in which I have to classify product category based on the information of the product like title, description and other attributes. It is working for different categories but getting biased in closed…
Sumit S Chawla
  • 3,180
  • 1
  • 14
  • 33
4
votes
1 answer

ValueError: Variable Embedding already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined

Based on this github link https://github.com/brightmart/text_classification/tree/master/a03_TextRNN While I run train a03_TextRNN with google_news_wor22vec.bin and a text file with my documents + labels, I've got these errors : How can I solve this…
4
votes
1 answer

text classification of large dataset in python

I have 2.2 million data samples to classify into more than 7500 categories. I am using pandas and sckit-learn of python to do so. Below is the sample of my dataset itemid description category 11802974…
4
votes
2 answers

Make a prediction using mxnet CNN model

Hi I'm a newbie to data science, I followed this tutorial https://mxnet.incubator.apache.org/tutorials/nlp/cnn.html but I am confused over how to make a single prediction using the trained model generated by the above mentioned tutorial. Please…
4
votes
1 answer

Difference between TaggedDocument and TaggedLineDocument in gensim? and How to work with files in a directory?

I am new to doc2vec and I wish to classify set of texts using it. I am confused about TaggedDocument and TaggedLineDocument. 1) What is the difference between two? Is it that TaggedLineDocument is collection of TaggedDocuments? 2) If I have a…
dfault
  • 41
  • 2
4
votes
1 answer

Specifying the # of hidden units in Facebook fasttext

In the paper on fasttext for supervised classification, the authors specified various quantities of hidden units by altering some parameter (h is the one on pages 3,4 - In table 1 you see "It has 10 hidden units and we evaluate it with and without…
Adam P.
  • 89
  • 5
4
votes
1 answer

Can I retrain an old model with new data using TensorFlow?

I am new to TensorFlow and I am just trying to see if my idea is even possible. I have trained a model with multi class classifier. Now I can classify a sentence in input, but I would like to change the result of CNN, for example, to improve the…
4
votes
1 answer

MultinomialNB - Theory vs practice

OK so I'm just studying Andrew Ng's Machine Learning course. I'm currently reading this chapter and want to try the Multinomial Naive Bayes (bottom of page 12) for myself using SKLearn and Python. So Andrew proposes a method, in which each email in…
lte__
  • 7,175
  • 25
  • 74
  • 131