Why the the total number in confusion matrix not same as the data input?

Question

Why the total confusion matrix does not have the same number os samples as the dataset? The dataset contains 7514 but the total at confusion matrix not exceed 2000.

Here is the code:

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(len(dataset)):
  text = re.sub('[^a-zA-Z]', ' ', dataset['Text'][i])
  text = text.lower()
  text = text.split()
  ps = PorterStemmer()
  text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
  text = ' '.join(text)
  corpus.append(text)

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

from sklearn import linear_model
classifier = linear_model.LogisticRegression(C=10)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix:\n",cm)

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
score1 = accuracy_score(y_test,y_pred)
score2 = precision_score(y_test,y_pred)
score3= recall_score(y_test,y_pred)
print("\n")
print("Accuracy is ",round(score1*100,2),"%")
print("Precision is ",round(score2,2))
print("Recall is ",round(score3,2))

Welcome to SO :) Please add the full data preparation code to enable others to have a more comprehensive view of the code. — meti, Dec 16 '21 at 11:48
Please do not "package" irrelevant questions to a single post (edited out); open more than one questions if necessary — desertnaut, Dec 17 '21 at 01:04

score 1 · Accepted Answer · answered Dec 16 '21 at 13:43

1

After you split data using train_test_split, you are left with 2255 samples in the test portion which is almost equal to 7514 X 0.3, then you determined the confusion matrix using this portion (test-portion). Now everything should make sense.

answered Dec 16 '21 at 13:43

meti

1,921
1
8
15

ouhh now i get it. thank you – mino Dec 16 '21 at 14:47
@mino please do not ask additional irrelevant questions in the comments; open a new question instead. – desertnaut Dec 17 '21 at 00:59

Why the the total number in confusion matrix not same as the data input?

1 Answers1