4

First off, I am new to DataBricks and PySpark, so I apologize if this is an easy solution I'm not seeing. My cluster is on DataBricks runtime 9.1 LTS (Spark 3.1.2, Scala 2.12).

I am working on an intro NLP problem doing restaurant review sentiment analysis. I have my pipeline built using various annotations followed by a logistic regression model. I am attempting to implement the CrossValidator object to tune my parameters.

When I attempt to use the CrossValidator, I get the following warning:

/databricks/spark/python/pyspark/ml/util.py:92: UserWarning: CrossValidator_0c70efdbf04c 
fit call failed but some spark jobs may still running for unfinished trials. 
To address this issue, you should enable pyspark pinned thread mode.

and the following error traced back to my fit() call:

IllegalArgumentException: requirement failed: Tensorflow model has not been initialized

My code for the CrossValidator is as follows:

pipe_added = Pipeline().setStages([pipe_sw_cstm, lr])

cv = CrossValidator(estimator = pipe_added,
                   estimatorParamMaps = lr_params,
                   evaluator = BinaryClassificationEvaluator(),
                   numFolds = 3,
                   seed = 31415
)

cvModel = cv.fit(train)

This is all supposed to be running within a loop and iterating over different initial pipelines, which are combined with the desired model into the variable pipe_added. This new composite pipeline is sent into the CrossValidator, along with my list of parameters for the desired model. I've stripped away most of the iterative code here in favor of a static version for debugging.

The Logistic Regression model correctly fits to the data when not using the CrossValidator object

An obvious first step is to enable pinned thread mode, I try setting the following in my cluster environment variables

PYSPARK_PIN_THREAD=true

but now I get a new error when running my code, still failing on the fit() call:

AttributeError: 'GatewayClient' object has no attribute 'thread_connection'

So it would seem that I should probably leave the pinned thread mode alone.

I have tried importing tensorflow, as well as mlflow.tensorflow, to no success. Any support would be greatly appreciated, debugging dependencies between libraries is already a weakness of mine, let alone on a new platform using a new main library.

EDIT 1: Using TrainValidationSplit raises the same warning and error.

1 Answers1

0

Hi: I think that you should put in the BinaryClassificationEvaluator the parameter labelcol= lr.getLabelCol() or the name of the label column that you use

Shaun Ramsey
  • 562
  • 4
  • 14