Delta Live Tables and ingesting AVRO

Question

So, im trying to load avro files in to dlt and create pipelines and so fourth. As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.

--example command 

in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)

data.createOrReplaceTempView("eventhub")

--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1             as col1
,data.field.test2             as col2
,data.field.fieldgrp.city     as city
from
    eventhub
""")

However, i am trying to replicate the process , but use delta live tables and pipelines. I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.

I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.

The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.

This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.

--unpacked data

@dlt.view(name=f"eventdata_v")
def eventdata_v():
    avroDf = spark.read.format("delta").table("live.bronze_file_list")
    jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
    data = spark.read.json(jsonRdd)
    return data

 
--trying to query the data but it does not recognise field names, even when i select "data" only
@dlt.view(name=f"eventdata2_v")
def eventdata2_v():
    df = (
        dlt.read("eventdata_v")
        .select("data.field.test1 ")
    )
    return df

I have been working on this for weeks, trying to use different approach's but still no luck. Any help will be so appreciated. Thankyou

show how do you try to do that in DLT... (You can update your question) — Alex Ott, Oct 03 '22 at 16:13
Another question - how data looks like inside the Avro files? — Alex Ott, Oct 20 '22 at 11:27
sorry for late reply.Its just compressed data. It looks like garbage at first sight, but after i unpack it, it makes sence. (it becomes a nested json). So i can unpack it in a normal script but not when i want to apply it to dlt. it just fails. This is the code i use to unpack it in a normal python window is; df1 = df0.select(df0.Body.cast('string')) # 2 rdd1 = df1.rdd.map(lambda x: x[0]) # 3 data = spark.read.json(rdd1) # 4 data.createOrReplaceTempView("my_data") — jo80, Nov 09 '22 at 16:41

Delta Live Tables and ingesting AVRO

0 Answers0