Kedro 0.16.3 and kedro[spark.SparkDataSet] pip libraries cannot be installed together on databricks cluster

Question

Till last week both kedro and kedro[spark.SparkDataSet] pip libraries were installed on the cluster. But since last 3-4 days they wont be installed together on the cluster. It shows that its a duplicate library but my code also fails as sparkdataset is not found by it. If I install only kedro I get the error as shown in the below screenshot error

Abhishek K · Answer 1 · 2022-05-25T12:34:27.177

To install kedro follow this installation prerequisites

Install Kedro

To install Kedro from the Python Package Index (PyPI) simply run:

pip install kedro

Sample code -

from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField, StringType,
                               IntegerType, StructType)

from kedro.extras.datasets.spark import SparkDataSet

schema = StructType([StructField("name", StringType(), True),
                     StructField("age", IntegerType(), True)])

data = [('Alex', 31), ('Bob', 12), ('Clarke', 65), ('Dave', 29)]

spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)

data_set = SparkDataSet(filepath="test_data")
data_set.save(spark_df)
reloaded = data_set.load()

reloaded.take(4)

Yes thanks after doing this im getting this error - DataSetError: No module named 'fsspec.asyn'. Failed to instantiate DataSet . Please note i have added fsspec to the cluster and pip installed it on the notebook — Msant, May 26 '22 at 02:44

score 0 · Answer 2 · answered May 25 '22 at 13:25

0

You don't need to install both pip install kedro["spark.SparkDataSet"]==0.16.3 is a superset of pip install kedro==0.16.3

answered May 25 '22 at 13:25

datajoely

1,466
10
13

I get the error as shown in the edited post. This was the reason why I had both in the first place. – Msant May 26 '22 at 02:45
Looking at your response to the other answer here I think you may have a conflicting library installed on the Databricks cluster? Is there anything installed that would conflict with the version of `fsspec` required by this version of kedro? – datajoely May 27 '22 at 06:20

Kedro 0.16.3 and kedro[spark.SparkDataSet] pip libraries cannot be installed together on databricks cluster

2 Answers2