Scikit Learn Dataset

Question

In Scikit learn, when doing X,Y = make_moons(500,noise = 0.2) and after printing X and Y, I see that they are like arrays with a bunch of entries but with no commas? I have data that I want to use instead of the Scikit learn moons dataset, but I dont understand what data type these Scikit learn data sets are and how I can make my data follow this data type.

score 0 · Answer 1 · answered May 23 '22 at 09:24

The first one X is a 2d array:

array([[-6.72300890e-01,  7.40277997e-01],
        [ 9.60230259e-02,  9.95379113e-01],
        [ 3.20515776e-02,  9.99486216e-01],
        [ 8.71318704e-01,  4.90717552e-01],
        ....
        [ 1.61911895e-01, -4.55349012e-02]])

Which contains the x-axis, and y-axis position of points.

The second part of the tuple: y, is an array that contains the labels (0 or 1 for binary classification).

array([0, 0, 0, 0, 1, ... ])

To use this data in a simple classification task, you could do the following:

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Create dataset
X, y = make_moons(500,noise = 0.2)

# Split dataset in a train part and a test part
train_X, test_X, train_y, test_y = train_test_split(X, y)

# Create the Logistic Regression classifier
log_reg = LogisticRegression()

# Fit the logistic regression classifier
log_reg.fit(train_X, train_y)

# Use the trained model to predit con the train and predict samples
train_y_pred = log_reg.predict(train_X)
test_y_pred = log_reg.predict(test_X)

# Print classification report on the training data
print(classification_report(train_y, train_y_pred))

# Print classification report on the test data
print(classification_report(test_y, test_y_pred))

The results are:

On training data

              precision    recall  f1-score   support

           0       0.88      0.87      0.88       193
           1       0.86      0.88      0.87       182

    accuracy                           0.87       375
   macro avg       0.87      0.87      0.87       375
weighted avg       0.87      0.87      0.87       375

On test data

              precision    recall  f1-score   support

           0       0.81      0.89      0.85        57
           1       0.90      0.82      0.86        68

    accuracy                           0.86       125
   macro avg       0.86      0.86      0.86       125
weighted avg       0.86      0.86      0.86       125

As we can see, the f1_score is not very different between the train and the test set, the model is not overfitting.

Scikit Learn Dataset

1 Answers1