Questions tagged [classification]

In machine learning and statistics, classification is the problem of identifying which of a set of categories a new observation belongs to, on the basis of a training set of data containing observations whose category membership (label) is known.

In machine learning and statistics, classification refers to the problem of predicting category memberships based on a set of pre-labeled examples. It is thus a type of supervised learning.

Some of the most important classification algorithms are support vector machines , logistic regression, naive Bayes, random forest and artificial neural networks .

When we wish to associate inputs with continuous values in a supervised framework, the problem is instead known as . The unsupervised counterpart to classification is known as (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity.

7859 questions
74
votes
9 answers

How to get most informative features for scikit-learn classifiers?

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features: viagra = None ok : spam = 4.5 : 1.0 hello = True ok :…
tobigue
  • 3,557
  • 3
  • 25
  • 29
73
votes
2 answers

What is out of bag error in Random Forests?

What is out of bag error in Random Forests? Is it the optimal parameter for finding the right number of trees in a Random Forest?
62
votes
4 answers

How to interpret weka classification?

How can we interpret the classification result in weka using naive bayes? How is mean, std deviation, weight sum and precision calculated? How is kappa statistic, mean absolute error, root mean squared error etc calculated? What is the…
user349821
  • 629
  • 1
  • 6
  • 4
59
votes
7 answers

Loss function for class imbalanced binary classifier in Tensor flow

I am trying to apply deep learning for a binary classification problem with high class imbalance between target classes (500k, 31K). I want to write a custom loss function which should be like:…
58
votes
2 answers

Why does prediction needs batch size in Keras?

In Keras, to predict class of a datatest, the predict_classes() is used. For example: classes = model.predict_classes(X_test, batch_size=32) My question is, I know the usage of batch_size in training, but why does it need a batch_size for…
malioboro
  • 3,097
  • 4
  • 35
  • 55
57
votes
2 answers

What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

When trying to get cross-entropy with sigmoid activation function, there is a difference between loss1 = -tf.reduce_sum(p*tf.log(q), 1) loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1) But they are the…
52
votes
19 answers

scikit learn output metrics.classification_report into CSV/tab-delimited format

I'm doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here's an extract from the Scikit Learn script for fitting the MNB model from __future__…
Seun AJAO
  • 611
  • 1
  • 9
  • 15
52
votes
5 answers

XGBoost XGBClassifier Defaults in Python

I am attempting to use XGBoosts classifier to classify some binary data. When I do the simplest thing and just use the defaults (as follows) clf = xgb.XGBClassifier() metLearn=CalibratedClassifierCV(clf, method='isotonic', cv=2) metLearn.fit(train,…
Chris Arthur
  • 1,139
  • 2
  • 10
  • 11
49
votes
3 answers

Save Naive Bayes Trained Classifier in NLTK

I'm slightly confused in regard to how I save a trained classifier. As in, re-training a classifier each time I want to use it is obviously really bad and slow, how do I save it and the load it again when I need it? Code is below, thanks in advance…
user179169
47
votes
2 answers

How is the feature score(/importance) in the XGBoost package calculated?

The command xgb.importance returns a graph of feature importance measured by an f score. What does this f score represent and how is it calculated? Output: Graph of feature importance
ishido
  • 4,065
  • 9
  • 32
  • 42
46
votes
6 answers

Options for deploying R models in production

There doesn't seem to be too many options for deploying predictive models in production which is surprising given the explosion in Big Data. I understand that the open-source PMML can be used to export models as an XML specification. This can then…
Cybernetic
  • 12,628
  • 16
  • 93
  • 132
45
votes
1 answer

Different decision tree algorithms with comparison of complexity or performance

I am doing research on data mining and more precisely, decision trees. I would like to know if there are multiple algorithms to build a decision trees (or just one?), and which is better, based on criteria such as Performance Complexity Errors in…
45
votes
4 answers

What is the difference between back-propagation and feed-forward Neural Network?

What is the difference between back-propagation and feed-forward neural networks? By googling and reading, I found that in feed-forward there is only forward direction, but in back-propagation once we need to do a forward-propagation and then…
USB
  • 6,019
  • 15
  • 62
  • 93
44
votes
8 answers

Error in Confusion Matrix : the data and reference factors must have the same number of levels

I've trained a Linear Regression model with R caret. I'm now trying to generate a confusion matrix and keep getting the following error: Error in confusionMatrix.default(pred, testing$Final) : the data and reference factors must have the same…
41
votes
1 answer

What is the difference between cross-entropy and log loss error?

What is the difference between cross-entropy and log loss error? The formulae for both seem to be very similar.
user3303020
  • 933
  • 2
  • 12
  • 26