1

I am currently trying to create a feature table and write the data from a dataframe into it:

from databricks import feature_store
from databricks.feature_store import feature_table
from databricks.feature_store import FeatureStoreClient

pyspark_df = dataframe.to_spark()

fs = FeatureStoreClient()

customer_feature_table = fs.create_table(
  name='FeatureStore.Features',
  primary_keys=['ID1', 'ID2'],
  schema = pyspark_df.schema,
  description='CustomerProfit features'
)

fs.write_table(
  name='FeatureStore.Features',
  df = pyspark_df,
  mode = 'overwrite'
)

If I execute this code I run into the following error mesage:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 554.0 failed 4
times, most recent failure: Lost task 0.3 in stage 554.0 (TID 1100) (10.139.64.9 executor 19):
ExecutorLostFailure (executor 19 exited caused by one of the running tasks) 
Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. 
Check driver logs for WARN messages.

I am using a the runtime version: 10.3 ML (includes Apache Spark 3.2.1, Scala 2.12)

I tried the same code on a smaller dataframe and it worked. I also tried to use a more powerful "driver type" but I still run into the issue. Why do I run into that error and is there some solution or workaround?

SqHu
  • 99
  • 1
  • 5
  • what kind of transformation are you doing? Are you using any Python user-defined functions? – Alex Ott Mar 23 '22 at 19:06
  • @AlexOtt Before I get to the point where i want to save the data I am doing some basic Data preparation, which also includes a user defined function. – SqHu Mar 24 '22 at 08:30
  • 1
    Try to avoid use of UDFs. Also, maybe try bigger nodes types for workers (not for driver) – Alex Ott Mar 24 '22 at 08:41
  • @AlexOtt Alright I got rid of the UDF and choose a bigger node. Unfortunately it is still not working. The dataframe I try to safe has quite some columns (~180) and millions of rows. Maybe it is just too big for the feature store... – SqHu Mar 25 '22 at 08:43

1 Answers1

1

Try using partition_columns. It will facilitate the writing and loading of the data. visit https://docs.databricks.com/machine-learning/feature-store/feature-tables.html for more information.

  fs.create_table(
  name=table_name,
  primary_keys = ['ID1', 'ID2'],
  df = df,
  partition_columns = "ID1",
  description = "enter table description"

)
Kiana Hadd
  • 11
  • 3
  • upvoted: it is nearly always a good idea to have partition_columns in the mix. One thing I wonder is how that works with the underlying `delta` tables – WestCoastProjects Jan 25 '23 at 23:36