1

I'm trying to parse incoming variable length stream records in databricks using Delta Live Tables. I'm getting the error:

Queries with streaming sources must be executed with writeStream.start();

Notebook code

@dlt.table (
    comment="xAudit Parsed"
)
def b_table_parsed():
    df = dlt.readStream("dlt_table_raw_view")
          
    for i in range(df.select(F.max(F.size('split_col'))).collect()[0][0]):
        df = df.withColumn("col"+str(i),df["split_col"][i])
    
    df = (df
          .drop("value","split_col")
         )
      
    return df

This all works fine against the actual source text files or a delta table using the interactive cluster but when I put it in DLT and and the source is streaming files from autoloader, it doesn't like it. I assume it's stream related.

I saw a different post about using .foreach maybe but that was using writeStream and not sure if I can or how to use it to return in a DLT table, or if there is another solution.

I'm very new to python, streaming and DLT so would appreciate if anyone can walk me through a detailed solution.

Trying to parse out variable length rows in a streaming source using a delta live table notebook in databricks. Works on the interactive cluster but not streaming in DLT

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
J. Johnson
  • 11
  • 1
  • Show the definition of the `dlt_table_raw_view` (better the full pipeline). Please note that in DLT it's `dlt.read_stream(...)` – Alex Ott Feb 19 '23 at 10:57

1 Answers1

0

The problem is in this piece of code: df.select(F.max(F.size('split_col'))).collect()[0][0] - you're trying to find a max + collect it from the stream that is by definition doesn't have start & end. Your code most probably works with batch DF or inside a function called from .foreachBatch that isn't supported by DLT.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132