Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

  • binary (binary classification)
  • one category out of k possible categories (multi-class)
  • a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions
0
votes
2 answers

Training classifier with large data

I was trying with two class text classification. Usually I created Pickle files of trained model and load those pickle in training phase to eliminate retraining. When I had 12000 review + more then 50000 tweets for each of the class, the training…
user123
  • 5,269
  • 16
  • 73
  • 121
0
votes
0 answers

Predicting the "no class" / unrecognised class in Weka Machine Learning

I am using Weka 3.7 to classify text documents based on their content. I have a set of text files in folders and they all belong to a certain category. Category A: 100 txt files Category B: 100 txt files ... Category X: 100 txt files I want to…
0
votes
1 answer

Save progress between multiple instances of partial_fit in Python SGDClassifier

I've successfully followed this example for my own text classification script. The problem is I'm not looking to process pieces of a huge, but existing data set in a loop of partial_fit calls, like they do in the example. I want to be able to add…
0
votes
1 answer

Load and save Weka Model using Java API?

I have my model on my hard drive at d:\MultiNomial.model. That model can be run correctly from weka. The model was built to classify a text using StringToVector as a filter. I am using java to load that model using Weka API. This is my source…
Lylia John
  • 51
  • 8
0
votes
1 answer

Emotion Classification in Text Using R

I have a enormous data set of texts, from which I have separated the text which holds particular keyword/s. Here is the data set with particular keywords. Now my next task is classify this data set according to 8 emotions and 2 sentiments, in total…
user5462317
0
votes
1 answer

Scikit-learn: precision_recall_fscore_support returns strange results

I am doing some text minining/classification and attempt to evaluate performance with the precision_recall_fscore_support function from the sklearn.metrics module. I am not sure how I can create a really small example reproducing the problem, but…
0
votes
1 answer

how to combine and feed different features to an algorithm for text classification

Ive got some 120k text files, and 12 categories in which I want to classify these documents into. Im using simple bag of words model and feeding it to NaiveBayes. But I was told that using a mixture of features would "help" OR rather I should…
user4069366
0
votes
1 answer

R - Automatic categorization of Wikipedia articles

I have been trying to follow this example by Norbert Ryciak, whom I havent been able to get in touch with. Since this article was written in 2014, some things in R have changed so I have been able to update some of those things in the code, but I…
tomcontr
  • 98
  • 9
0
votes
1 answer

Compare documents by sequence vector

I'm trying to classify documents by sequence vector. Basically, I have a vocabulary (more than 5000 words). Each document is converted to a vector of integer numbers so that each element in the vector corresponds the position of the word in the…
lenhhoxung
  • 2,530
  • 2
  • 30
  • 61
0
votes
1 answer

Letter classificator inaccuracy

I am working on a university project to detect letters from a photo. I can successfully extract words from the photo, cut them into single letters which are black an a white background. These pictures look quite clear. I have trained the SVC…
Ghostwriter
  • 2,461
  • 2
  • 16
  • 18
0
votes
1 answer

Determining the name of a company from a given text

I have a site which is in the stock market domain. The site has a lot of user generated content in terms of forum posts, comments etc. Also, I have a database table that consists of names of all companies (around 5000) listed in the stock…
milan m
  • 2,164
  • 3
  • 26
  • 40
0
votes
1 answer

R caret package (rpart)

I get the below error when using rpart library dt <- rpart(formula, method="class", data=full.df.allAttr.train); Error in model.frame.default(formula = formula, data = full.df.allAttr.train, : object is not a matrix When i convert…
user2478236
  • 691
  • 12
  • 32
0
votes
1 answer

Need help applying scikit-learn to this unbalanced text categorization task

I have a multi-class text classification/categorization problem. I have a set of ground truth data with K different mutually exclusive classes. This is an unbalanced problem in two respects. First, some classes are a lot more frequent than others.…
I Z
  • 5,719
  • 19
  • 53
  • 100
0
votes
1 answer

How to select best parameters for SVM linear kernel type

I perform a classification of two labels using libsvm. But I don't get good results for the default parameters of SVM kernel type = linear. Can any one please tell me a way to find best parameters for SVM linear kernel type
user5232014
0
votes
1 answer

Naive Bayes with Apache Spark MLlib

I'm using Naive Bayes with Apache Spark MLlib for Text classification follow tutorial: http://avulanov.blogspot.com/2014/08/text-classification-with-apache-spark.html /* instantiate Spark context (not needed for running inside Spark shell */ val sc…
1 2 3
99
100