2

So, im trying to load avro files in to dlt and create pipelines and so fourth. As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.

--example command 

in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)

data.createOrReplaceTempView("eventhub")

--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1             as col1
,data.field.test2             as col2
,data.field.fieldgrp.city     as city
from
    eventhub
""")

However, i am trying to replicate the process , but use delta live tables and pipelines. I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.

I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.

The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.

This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.

--unpacked data

@dlt.view(name=f"eventdata_v")
def eventdata_v():
    avroDf = spark.read.format("delta").table("live.bronze_file_list")
    jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
    data = spark.read.json(jsonRdd)
    return data

 
--trying to query the data but it does not recognise field names, even when i select "data" only
@dlt.view(name=f"eventdata2_v")
def eventdata2_v():
    df = (
        dlt.read("eventdata_v")
        .select("data.field.test1 ")
    )
    return df

I have been working on this for weeks, trying to use different approach's but still no luck. Any help will be so appreciated. Thankyou

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
jo80
  • 21
  • 2
  • show how do you try to do that in DLT... (You can update your question) – Alex Ott Oct 03 '22 at 16:13
  • Another question - how data looks like inside the Avro files? – Alex Ott Oct 20 '22 at 11:27
  • sorry for late reply.Its just compressed data. It looks like garbage at first sight, but after i unpack it, it makes sence. (it becomes a nested json). So i can unpack it in a normal script but not when i want to apply it to dlt. it just fails. This is the code i use to unpack it in a normal python window is; df1 = df0.select(df0.Body.cast('string')) # 2 rdd1 = df1.rdd.map(lambda x: x[0]) # 3 data = spark.read.json(rdd1) # 4 data.createOrReplaceTempView("my_data") – jo80 Nov 09 '22 at 16:41

0 Answers0