1

I am working on a pipeline in Databricks > Workflows > Delta Live Tables and having an issue with the streaming part.

Expectations:

  • One bronze table reads the json files with AutoLoader (cloudFiles), in a streaming mode (spark.readStream)
  • One silver table reads and flattens the bronze table in streaming (dlt.read_stream)

Result:

  1. When taking the root location as the source (load /*, several hundreds of files): the pipelines starts, but the number of rows/files appended is not updated in the graph until the bronze part be completed. Then, the silver part starts, the number of files/rows never updates either and the pipeline terminates with a memory error.
  2. When taking a very small number of files (/specific_folder among hundreds) : the pipeline runs well and terminates with no error, but again, the number of rows/files appended is not updated in the graph until each part is completed.

This led me to the conclusion that the pipeline seems not to run in a streaming mode. Maybe I am missing something about the config or how to run properly a DLT pipeline, and would need your help on this please.

Here is the configuration of the pipeline:

{
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"clusters": [
    {
        "label": "default",
        "aws_attributes": {
            "instance_profile_arn": "arn:aws:iam::xxxxxxxxxxxx:instance-profile/iam_role_example"
        },
        "autoscale": {
            "min_workers": 1,
            "max_workers": 10,
            "mode": "LEGACY"
        }
    }
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "PRO",
"photon": false,
"libraries": [
    {
        "notebook": {
            "path": "/Repos/user_example@xxxxxx.xx/dms/bronze_job"
        }
    }
],
"name": "01-landing-task-1",
"storage": "dbfs:/pipelines/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"configuration": {
    "SCHEMA": "example_schema",
    "RAW_MOUNT_NAME": "xxxx",
    "DELTA_MOUNT_NAME": "xxxx",
    "spark.sql.parquet.enableVectorizedReader": "false"
},
"target": "landing"}

Here is the code of the pipeline (the query in the silver table contains many more columns with a get_json_object, ~30 actually):

import dlt
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.window import Window

RAW_MOUNT_NAME = spark.conf.get("RAW_MOUNT_NAME")
SCHEMA = spark.conf.get("SCHEMA")
SOURCE = spark.conf.get("SOURCE")
TABLE_NAME = spark.conf.get("TABLE_NAME")
PRIMARY_KEY_PATH = spark.conf.get("PRIMARY_KEY_PATH")

@dlt.table(
    name=f"{SCHEMA}_{TABLE_NAME}_bronze",
    table_properties={
        "quality": "bronze"
    }
)
def bronze_job():
    load_path = f"/mnt/{RAW_MOUNT_NAME}/{SOURCE}/5e*"

    return spark \
            .readStream \
            .format("text") \
            .option("encoding", "UTF-8") \
            .load(load_path) \
            .select("value", "_metadata") \
            .withColumnRenamed("value", "json") \
            .withColumn("id", F.expr(f"get_json_object(json, '$.{PRIMARY_KEY_PATH}')")) \
            .withColumn("_etl_timestamp", F.col("_metadata.file_modification_time")) \
            .withColumn("_metadata", F.col("_metadata").cast(T.StringType())) \
            .withColumn("_etl_operation", F.lit("U")) \
            .withColumn("_etl_to_delete", F.lit(False)) \
            .withColumn("_etl_file_name", F.input_file_name()) \
            .withColumn("_etl_job_processing_timestamp", F.current_timestamp()) \
            .withColumn("_etl_table", F.lit(f"{TABLE_NAME}")) \
            .withColumn("_etl_partition_date", F.to_date(F.col("_etl_timestamp"), "yyyy-MM-dd")) \
            .select("_etl_operation", "_etl_timestamp", "id", "json", "_etl_file_name", "_etl_job_processing_timestamp", "_etl_table", "_etl_partition_date", "_etl_to_delete", "_metadata")

@dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_silver",
table_properties = {
        "quality": "silver",
        "delta.autoOptimize.optimizeWrite": "true", 
        "delta.autoOptimize.autoCompact": "true"
}
)
def silver_job():
    df = dlt.read_stream(f"{SCHEMA}_{TABLE_NAME}_bronze").where("_etl_table == 'extraction'")
    return df.select(
                    df.id.alias('medium_id'),
                    F.get_json_object(df.json, '$.request').alias('request_id'))

Thank you very much for your help!

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

0 Answers0