Pyspark Logistic Regression has zero coefficients after fitting

Question

Good afternoon.

I am solving a multi-label classification problem with the help of LogisticRegression in pyspark. However, after I fit a model to the data, all elements of the CoefficientMatrix of the model are zeroes.

I noticed, that if I decrease a number of samples in the training set to some level, the model sometimes actually learns something, and coefficients are not zeroes. It actually depends on the training subsample: some random seeds provide subsamples with non-zero coefficients, and some with zero coefficients. I checked the input for nans and infs: everything is ok by that side.

The data is sparse. I found a small subsample that provides zero coefficient and started sampling data from it to decrease the number of objects even more, so I can look more closely at the problem causing objects. Finally I got a small bad subsample of 16 elements. All objects but one had sparse features. When I threw away the only dense object, coefficients became realistic again.

Why is that behavior occuring? What should I do in such a situation?

I have about 90 labels in the target, 356 features. The data is sparse. Sklearn model on the same dataset fits well.

I am using pyspark 2.1.0 and python 3.5.3. Here is an example of my code:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

df = sqlContext.table('data')

assert df.columns[-1] == 'label'

assembler = VectorAssembler(inputCols=df.columns[:-1], outputCol='features')
df = assembler.transform(df)

# frac is a float between 0 and 1
train, test = df.sample(fraction=frac,
                        withReplacement=False).randomSplit([0.75, 0.25])

lr = LogisticRegression(maxIter=100, standartization=False, family='auto')
model = lr.fit(train)

print(model.coefficientMatrix.toArray().sum(),
      model.coefficientMatrix.toArray().min(),
      model.coefficientMatrix.toArray().max())

Pyspark Logistic Regression has zero coefficients after fitting

0 Answers0