14

I want to start develop an application using Machine Learning. I want to classify text - spam or not spam. I have 2 files - spam.txt, ham.txt - that contain thousand of sentences each file. If I want to use a classifier, let's say LogisticRegression.

For example, as I saw on the Internet, to fit my model I need to do like this:

`lr = LogisticRegression()
model = lr.fit(X_train, y_train)`

So here comes my question, what are actually X_train and y_train? How can I obtain them from my sentences? I searched on the Internet, I did not understand, here is my last call, I am pretty new to this topic. Thank you!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 4
    X_train is all the instance with attributes, y_train is the label of each instance. Because your problem is binary classification problem and using logistic regression. your y_train is either 0 or 1(spam or not). – Heaven Jun 03 '18 at 01:36

1 Answers1

7

According to the documentation (see here):

  • X corresponds to your float feature matrix of shape (n_samples, n_features) (aka. the design matrix of your training set)
  • y is the float target vector of shape (n_samples,) (the label vector). In your case, label 0 could correspond to a spam example, and 1 to a ham one

The question is now about how to get a float feature matrix from text data.

A common scheme is to use a tf-idf vectorisation (more on this here), which is available in sklearn.

The vectorisation can be chained with the logistic regression via the Pipeline API of sklearn.

This is how the code would look like roughly

from itertools import chain

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

import numpy as np

# prepare string data
with open('spam.txt', 'r') as f:
   spam = f.readlines()

with open('ham.txt', 'r') as f:
   ham = f.readlines()

text_train = list(chain(spam, ham))

# prepare labels
labels_train = np.concatenate((np.zeros(len(spam)),np.ones(len(ham))))

# build pipeline
vectorizer = TfidfVectorizer()
regressor = LogisticRegression()

pipeline = Pipeline([('vectorizer', vectorizer), ('regressor', regressor)])

# fit pipeline
pipeline.fit(text_train, labels_train)

# test predict
test = ["Is this spam or ham?"]
pipeline.predict(test) # value in [0,1] 
syltruong
  • 2,563
  • 20
  • 33
  • Lets say I want to have 2 categories of text: test and train. I divide my data in 2 (80-20%, 70-30% whatever) and I can obtain `text_test` the same way as `text_train`? I am talking about obtaining `X_test` and `y_test`. –  Jun 03 '18 at 06:00
  • 1
    Yes you can. The pipeline will have learnt the _idf_ values of the vocabulary words present in your train set, as well as weight and bias in the logistic regression. `X_test` can thus be fed to the pipeline's `predict` method, which output can be compared to `y_test`. – syltruong Jun 03 '18 at 07:28
  • syltruong I have some more questions, do you think you can help me via mail? –  Jun 05 '18 at 17:40
  • Holy cow! The explanation for the `X` and `Y` is hidden pretty well :-\ – t3chb0t Apr 17 '20 at 15:42