Not able to access certain JSON properties in Autoloader

Question

I have a JSON file that is loaded by two different Autoloaders. One uses schema evolution and besides replacing spaces in the json property names, writes the json directly to a delta table, and I can see all the values are there properly. In the second one I am mapping to a defined schema and only use a subset of properties. So use a lot of withColumn and then a select to narrows to my defined column list.

Autoloader definition:

df = (spark
    .readStream
    .format('cloudFiles') 
    .option('cloudFiles.format', 'json') 
    .option('multiLine', 'true') 
    .option('cloudFiles.schemaEvolutionMode','rescue') 
    .option('cloudFiles.includeExistingFiles','true') 
    .option('cloudFiles.schemaLocation', bronze_schema)
    .option('cloudFiles.inferColumnTypes', 'true')
    .option('pathGlobFilter','*.json')
    .load(upload_path)
    .transform(lambda df: remove_spaces_from_columns(df))
    .withColumn(...

Writer:

df.writeStream.format('delta') \
    .queryName(al_stream_name) \
    .outputMode('append') \
    .option('checkpointLocation', checkpoint_path) \
    .option('mergeSchema', 'true') \
    .trigger(once = True) \
    .table(bronze_table)

Issue is that some of the source columns are ok load and I get their values, and others are constantly null in the output table.

For example:

   .withColumn('vl_rating', col('risk_severity.value')) # works
   .withColumn('status',    col('status.name')) # always null
...
   .select(
        'rating',
        'status',
...

json is quite simple, these are all string values, they are always populated. The same code works against another simular json file in another autoloader without issue.

I have run out of ideas to fault find on this. My imports are minimal, outside of Autoloader the JSON loads fine. e.g

%python
import pyspark.sql.functions as psf

jsontest = spark.read.option('inferSchema','true').json('dbfs:....json')
df = jsontest.withColumn('status', psf.col('status.name')).select('status')
display(df)

Results in the values of the status.name property of the json file

Any ideas would be greatly appreciated.

score 0 · Answer 1 · answered Jul 22 '22 at 15:22

I have found generally what is causing this. Interesting cause!

I am scanning a whole directory of json files, and the schema evolves over time (as expected). But when I clear out the autoloader schema and checkpoint directories and only scan the latest json file it all works correctly.

So what I surmise is that something in schema evolution with the older json files causes Autoloader to get into a state where it will not put certain properties into the stream to the writer.

If anyone has any recommendation on how to implement some data quality analysis in an Autoloader I would be most appreciative if you would share.

Not able to access certain JSON properties in Autoloader

1 Answers1