2

I have a question related to a penalized regression model with Lasso and interpreting returning values. I have text content and want to find each the most predictive words for a class.

Code and Data

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Lasso

# Import test data
data = pd.read_csv('https://pastebin.com/raw/rXr4kd8S')

# Make ngrams
vectorizer = CountVectorizer(min_df=0.00, max_df=1.0, max_features=1000, stop_words='english', binary=True, lowercase=True, ngram_range=(1, 1))
grams = vectorizer.fit_transform(data['text'])

# Show features (words)
vectorizer.get_feature_names()

# Show Lasso coefficients
def lassoRegression(para1, para2):
    lasso = Lasso(alpha = 0, fit_intercept=True, normalize=True, max_iter=1000)
    lasso.fit(para1, para2)
    return lasso.coef_

model_lasso = lassoRegression(grams, data['label'])

# Sort coefficients
lasso_coef = pd.DataFrame(np.round_(model_lasso, decimals=2), vectorizer.get_feature_names(), columns = ["penalized_regression_coefficients"])
lasso_coef = lasso_coef[lasso_coef['penalized_regression_coefficients'] != 0]
lasso_coef = lasso_coef.sort_values(by = 'penalized_regression_coefficients', ascending = False)
lasso_coef

# Top/Low 10 values
lasso_coef = pd.concat([lasso_coef.head(10),lasso_coef.tail(10)], axis=0)

# Plot
ax = sns.barplot(x = 'penalized_regression_coefficients', y= lasso_coef.index , data=lasso_coef)
ax.set(xlabel='Penalized Regression Coeff.')
plt.show()

Changing alpha causes following problems:

Out: For Lasso(alpha = 0, ...)

ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.

    penalized_regression_coefficients
data    0.62
awesome 0.33
content 0.31
performs    0.05
enter   0.02
great   -0.01

Out: For Lasso(alpha = 0.001, ...)

penalized_regression_coefficients
great   -0.93

Out: For Lasso(alpha = 1, ...)

penalized_regression_coefficients
empty

Questions:

  • alpha = 0 returns an error (but values) and any other alpha setting returns almost nothing. Considering the input data, even after stopword removal, I would have expected more words with corresponding positive and negative weights. Is something wrong here? Note that the data input has intentionally repetitive elements as I hoped to test the reliability of the model that way.
  • How do I interpret the values correctly? What does data=0.62 mean?
  • Do I assume correctly that all negative values are predictors for label "0" and all positive values a predictor for label "1"?
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Christopher
  • 2,120
  • 7
  • 31
  • 58
  • [similar question](https://stats.stackexchange.com/questions/319861/how-to-interpret-lasso-shrinking-all-coefficients-to-0)? Also, looks like you have the code down so you may want to ask on [Cross Validation](https://stats.stackexchange.com/) which is the stats stack exchange site – lwileczek May 29 '18 at 14:48
  • Also, general lasso regression is going to be a variation of OLS for predicting a continuous response variable. You import Logistic Regression but don't seem to use it – lwileczek May 29 '18 at 14:54
  • Could you show a sample of the data? Perhaps print the head of the dataframe – lwileczek May 29 '18 at 14:55
  • Thanks for your feedback, I adjusted the code. If you go to https://pastebin.com/raw/rXr4kd8S, you can see the data. The script should be reproducible, given libs are installed. – Christopher May 29 '18 at 14:58
  • Data: https://pastebin.com/raw/rXr4kd8S – Christopher May 29 '18 at 15:35

2 Answers2

1

The alpha refers to the penalty on the elastic net. Called either the lambda or the alpha. alpha=0 is equivalent to ordinary least squares. Lasso regression and force coefficients toward 0. The smaller the coefficient the less important it is or less variance it explains. The actual value here will be less important since it will be used in logistic regression because it will end up being used in an exponential. So you last assumption is pretty much correct where you if the coeffienct is possitive then that variable indicates a higher probability of label 1 which each occurrence of that word.

as for why your lasso regression will not converge you can read here

I suggest reading up on the methods more before using them. This course talks a lot about statistics and explains why and when to use lasso regression. If you are familiar with OLS then you can understand the interpretation of the coefficients. If all your other variables hold constant, for each increase in 1 unit of variable data you can expected the response variable Y to increase 0.62 on average. But as I as I said previously this will lead to a percentage change when used in the logistic equation.

please see Cross Validation for more help on statistics.

lwileczek
  • 2,084
  • 18
  • 27
  • `alpha = 0` is equivalent to an ordinary least square. Alpha is the co-efficient for the L1 norm term – Sahil Puri May 29 '18 at 15:09
  • 1
    But how do I interpret the coefficients? Are they the same as "beta coefficients" in R glmnet? Does data = 0.62 mean that the presence of the word "data" in a document increases chances by 62% to be class "1"? – Christopher May 29 '18 at 15:28
1

Okay, so a few things here.

I see that you have logistic regression which is not used in your script. You might want to think about using linear v/s logistic regression.

The code is trying to tell you that close to alpha=0 the Lasso regression results are not reliable. Why is this the case? Well if you go to code for the lasso you'll eventually reach - https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/cd_fast.pyx Line 516 where there is a float comparison going on.

What does it mean when your alpha goes slowly towards 0? Well it means that your regression is similar to an OLS regression. Now if your coefficients are quickly disappearing, it implies that your coefficients are very weak in explaining the results.

Your TODO list - 1. Try both OLS and Logistic to see which one is more appropriate 2. Look at the t-statistics and see if any result is significant 3. If nothing is significant, then maybe look at how you setup the regression, there might be a bug in the code. 4. If any of the concepts are unclear, go to the course in mentioned by @lwileczek

Sahil Puri
  • 491
  • 3
  • 12
  • Hi Sahil, thanks for your comments. It is very useful. Question: Were you able to have a look at the data/code (I made it reproducible)? Given the data structure, I can't explain the model behavior, still. Given we have in 90% of cases one repeating sentence for "1" and one repeating sentence for "0", I would expect the coefficients to be much stronger? "Look at the t-statistics and see if any result is significant" -> can I do this in the code above? – Christopher May 29 '18 at 15:26
  • I can't open the link for some reason. I'm sure others might have this issue as well. Could you print the head of the DataFrame before the lasso regression ? – Sahil Puri May 29 '18 at 15:28
  • You can do OLS via `from sklearn.linear_model import LinearRegression` – Sahil Puri May 29 '18 at 15:33