Good afternoon.
I am solving a multi-label classification problem with the help of LogisticRegression in pyspark. However, after I fit a model to the data, all elements of the CoefficientMatrix of the model are zeroes.
I noticed, that if I decrease a number of samples in the training set to some level, the model sometimes actually learns something, and coefficients are not zeroes. It actually depends on the training subsample: some random seeds provide subsamples with non-zero coefficients, and some with zero coefficients. I checked the input for nans and infs: everything is ok by that side.
The data is sparse. I found a small subsample that provides zero coefficient and started sampling data from it to decrease the number of objects even more, so I can look more closely at the problem causing objects. Finally I got a small bad subsample of 16 elements. All objects but one had sparse features. When I threw away the only dense object, coefficients became realistic again.
Why is that behavior occuring? What should I do in such a situation?
I have about 90 labels in the target, 356 features. The data is sparse. Sklearn model on the same dataset fits well.
I am using pyspark 2.1.0 and python 3.5.3. Here is an example of my code:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
df = sqlContext.table('data')
assert df.columns[-1] == 'label'
assembler = VectorAssembler(inputCols=df.columns[:-1], outputCol='features')
df = assembler.transform(df)
# frac is a float between 0 and 1
train, test = df.sample(fraction=frac,
withReplacement=False).randomSplit([0.75, 0.25])
lr = LogisticRegression(maxIter=100, standartization=False, family='auto')
model = lr.fit(train)
print(model.coefficientMatrix.toArray().sum(),
model.coefficientMatrix.toArray().min(),
model.coefficientMatrix.toArray().max())