PySpark errors when trying to find the rows in a pyspark dataframe

Question

I have a PySpark dataframe df (type(df) is pyspark.sql.dataframe.DataFrame) and it has 4 columns. I'm trying to find the number of rows it has using df.count(), but I keep getting the error messages below.

WARN PythonRunner: Detected deadlock while completing task 24.0 in stage 4 (TID 28): Attempting to kill Python Worker
...
ERROR Executor: Exception in task 24.0 in stage 4.0 (TID 28)
...
ValueError: Shape of passed values is (4,1), indices imply (4,4)

I read that the ValueError usually means there are 4 rows of (4,1), but the examples I saw to resolve this are for pandas dataframe. I'm not sure how to resolve this for PySpark dataframe.

Also, should I be concerned about the deadlock warning? Is it related to the ValueError?

ETA: Added the code that I have before calling df.count(). Basically, I'm trying to calculate SHAP values for my model based on the code in this article.

explainer = shap.TreeExplainer(model)
shap_columns = ['feature1', 'feature2', 'feature3', 'feature4']

def calculate_shap(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for X in iterator:
        yield pd.DataFrame(
            explainer.shap_values(np.array(X), check_additivity=False)[0],
            columns=shap_columns,
        )

return_schema = StructType()
for feature in shap_columns:
    return_schema = return_schema.add(StructField(feature, FloatType()))

df = spark_X.mapInPandas(calculate_shap, schema=return_schema)
df.count()

Both df and spark_X are of type pyspark.sql.dataframe.DataFrame. df.printSchema() showed the 4 columns correctly.

ETA2: Thanks to the suggestion by @samkart, I enclosed explainer.shap_values(np.array(X), check_additivity=False)[0] with [explainer.shap_values(np.array(X), check_additivity=False)[0]], and there is no more error. But the number of rows returned is only 18K, while I'm expecting 180M. Why is this so?

Could you please include the code you are running too? `df.count()` is an action so execution is triggered, but the actual code causing the error may be from earlier. — viggnah, Aug 16 '22 at 13:38
i'm guessing the ValueError is from the vector function, not spark dataframe. see [this](https://stackoverflow.com/questions/50874117/pandas-dataframe-shape-of-passed-values-is-1-4-indices-imply-4-4) -- enclosing the `explainer.shap...` line in box brackets might do the trick — samkart, Aug 17 '22 at 07:06
Thanks for this. I no longer get the error, but the number of rows returned is only 18K, while I'm expecting 180M. Not sure why this is so. — Rayne, Aug 18 '22 at 01:28

PySpark errors when trying to find the rows in a pyspark dataframe

0 Answers0