I have a question related to a penalized regression model with Lasso and interpreting returning values. I have text content and want to find each the most predictive words for a class.
Code and Data
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import Lasso
# Import test data
data = pd.read_csv('https://pastebin.com/raw/rXr4kd8S')
# Make ngrams
vectorizer = CountVectorizer(min_df=0.00, max_df=1.0, max_features=1000, stop_words='english', binary=True, lowercase=True, ngram_range=(1, 1))
grams = vectorizer.fit_transform(data['text'])
# Show features (words)
vectorizer.get_feature_names()
# Show Lasso coefficients
def lassoRegression(para1, para2):
lasso = Lasso(alpha = 0, fit_intercept=True, normalize=True, max_iter=1000)
lasso.fit(para1, para2)
return lasso.coef_
model_lasso = lassoRegression(grams, data['label'])
# Sort coefficients
lasso_coef = pd.DataFrame(np.round_(model_lasso, decimals=2), vectorizer.get_feature_names(), columns = ["penalized_regression_coefficients"])
lasso_coef = lasso_coef[lasso_coef['penalized_regression_coefficients'] != 0]
lasso_coef = lasso_coef.sort_values(by = 'penalized_regression_coefficients', ascending = False)
lasso_coef
# Top/Low 10 values
lasso_coef = pd.concat([lasso_coef.head(10),lasso_coef.tail(10)], axis=0)
# Plot
ax = sns.barplot(x = 'penalized_regression_coefficients', y= lasso_coef.index , data=lasso_coef)
ax.set(xlabel='Penalized Regression Coeff.')
plt.show()
Changing alpha causes following problems:
Out: For Lasso(alpha = 0, ...)
ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
penalized_regression_coefficients
data 0.62
awesome 0.33
content 0.31
performs 0.05
enter 0.02
great -0.01
Out: For Lasso(alpha = 0.001, ...)
penalized_regression_coefficients
great -0.93
Out: For Lasso(alpha = 1, ...)
penalized_regression_coefficients
empty
Questions:
alpha = 0
returns an error (but values) and any other alpha setting returns almost nothing. Considering the input data, even after stopword removal, I would have expected more words with corresponding positive and negative weights. Is something wrong here? Note that the data input has intentionally repetitive elements as I hoped to test the reliability of the model that way.- How do I interpret the values correctly? What does data=0.62 mean?
- Do I assume correctly that all negative values are predictors for label "0" and all positive values a predictor for label "1"?