1

I'm trying to adapt the simple GLM example from the docs to use Tweedie:

def create_fake_losses_data(self):
    df = self._spark.createDataFrame([
        ("a", 100.0, 12, 1, Vectors.dense(0.0, 0.0)),
        ("b", 0.0, 24, 1, Vectors.dense(1.0, 2.0)),
        ("c", 0.0, 36, 1, Vectors.dense(0.0, 0.0)),
        ("d", 2000.0, 48, 1, Vectors.dense(1.0, 1.0)), ], ["user_hashed", "label", "offset", "weight", "features"])
    logging.info(df.collect())
    setattr(self, 'fake_data', df)
    try:
        glr = GeneralizedLinearRegression(
            family="tweedie", variancePower=1.5, offsetCol='offset')
        glr.setRegParam(0.3)
        model = glr.fit(df)
        logging.info(model)
    except Py4JJavaError as e:
        print(e)
    return self

This gives me the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o96.toString.
: java.util.NoSuchElementException: Failed to find a default value for link
        at org.apache.spark.ml.param.Params.$anonfun$getOrDefault$2(params.scala:756)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.ml.param.Params.getOrDefault(params.scala:756)
        at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:753)
        at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:41)
        at org.apache.spark.ml.param.Params.$(params.scala:762)
        at org.apache.spark.ml.param.Params.$$(params.scala:762)
        at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:41)
        at org.apache.spark.ml.regression.GeneralizedLinearRegressionModel.toString(GeneralizedLinearRegression.scala:1117)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

According to the docs, however, when using Tweedie it seems you should leave link undefined. So I'm very confused here. Has anyone actually done a proper Tweedie regression using PySpark (or any version of Spark really)? The docs are also confusing me regarding the difference between variancePower and linkPower when using Tweedie. Which am I supposed to use? Which one is the p in a Tweedie distribution?

Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
  • I'm running 3.1.2. – Evan Zamir Jan 25 '22 at 21:20
  • I think this is caused by the line `logging.info(model)`. which tries to call `model.toString`. It should work fine if you remove that part. – blackbishop Jan 25 '22 at 21:31
  • For the other questions, I'm not using Spark ML, but reading from the [docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GeneralizedLinearRegression.html#generalizedlinearregression) "the `link` value for tweedie is specified using parameter `linkPower` and has default value `1-variancePower`". – blackbishop Jan 25 '22 at 21:35
  • @blackbishop I've never had any issues logging a model like this before. And specifically I have used the same code (with different data) for a GLM model with `poisson` as the family and it works fine. – Evan Zamir Jan 25 '22 at 21:38
  • You'll get the same error if you use `poisson` family without explicitly specifying the `link` parameter (even if there is one by default for each family). maybe the method that prints the model to string uses the values specified via the parameters and fails when it's not specified (just guessing...). – blackbishop Jan 25 '22 at 21:47
  • Thanks. I filed a bug report because this really doesn't seem like it should be the expected behavior. I wonder how many people have run into this before. – Evan Zamir Jan 25 '22 at 21:49

0 Answers0