Till last week both kedro and kedro[spark.SparkDataSet] pip libraries were installed on the cluster. But since last 3-4 days they wont be installed together on the cluster. It shows that its a duplicate library but my code also fails as sparkdataset is not found by it. If I install only kedro I get the error as shown in the below screenshot error
Asked
Active
Viewed 582 times
2 Answers
0
To install kedro follow this installation prerequisites
To install Kedro from the Python Package Index (PyPI) simply run:
pip install kedro
Sample code -
from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField, StringType,
IntegerType, StructType)
from kedro.extras.datasets.spark import SparkDataSet
schema = StructType([StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
data = [('Alex', 31), ('Bob', 12), ('Clarke', 65), ('Dave', 29)]
spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
data_set = SparkDataSet(filepath="test_data")
data_set.save(spark_df)
reloaded = data_set.load()
reloaded.take(4)

Abhishek K
- 3,047
- 1
- 6
- 19
-
Yes thanks after doing this im getting this error - DataSetError: No module named 'fsspec.asyn'. Failed to instantiate DataSet . Please note i have added fsspec to the cluster and pip installed it on the notebook – Msant May 26 '22 at 02:44
0
You don't need to install both pip install kedro["spark.SparkDataSet"]==0.16.3
is a superset of pip install kedro==0.16.3

datajoely
- 1,466
- 10
- 13
-
I get the error as shown in the edited post. This was the reason why I had both in the first place. – Msant May 26 '22 at 02:45
-
Looking at your response to the other answer here I think you may have a conflicting library installed on the Databricks cluster? Is there anything installed that would conflict with the version of `fsspec` required by this version of kedro? – datajoely May 27 '22 at 06:20