Handling JSON schema changes in pyspark

Question

I'm reading JSON log data from s3 bucket. LogDNA are notorious in changing the schema of their log output and recently have done so again.

This presents a situation for me where:

Files dated <=2019-07-29 contain the schema (simplified)

{_source.message: "Hello World"}

and files dating =>2019-07-30 have the schema

{message: "Hello World"}

Spark in turn inferes the following schema (again, simplified)

root
 |-- _source: struct (nullable = true)
 |    |-- message: string (nullable = true)
 |-- message: string (nullable = true)

I would like to map both of these schemas into a single Dataframe column. How should I do this?

Since this schema is automatically discovered a mapping function would be best approach but even manual mapping is ok for me ATM

score 0 · Accepted Answer · answered Sep 02 '19 at 22:10

I've managed to solve it, it's a bit involved so I'll share it a few bits of the complete solution in the hope that people will find it useful.

The solution is essentially code generation, the SQL flavored kind.

Loading the data

logs_json = spark.read.json("/mnt/seedx-ops-logs-prod/*/*/*.json.gz")
logs_json.registerTempTable("logs_json")

Defining a helper function

# https://stackoverflow.com/questions/39982135/apache-spark-dealing-with-case-statements

def _append_case_when(field, _end_counter):
  _sql = ''

  if _end_counter > 0:
    _sql += ' ELSE '

  _sql += f""" CASE WHEN {field} is not null THEN {field} """

  return _sql

Generating the SQL

root = logs_json.schema.jsonValue()

# Complete Schema
fields_schema = root['fields']

# Old schema is netsted under the {"_source": X} subtree.
_source = next(x for x in fields_schema if x['name'] == '_source')
old_fields_schema = _source['type']['fields']
old_schema = {s['name']: s for s in old_fields_schema}

# New schema
new_schema = {s['name']: s for s in fields_schema}
del new_schema['_source']

keys_new_schema_meta = {field['name'] for field in new_schema['_meta']['type']['fields']}
del new_schema['_meta']
keys_new_schema = set(new_schema.keys())


keys_old_schema_meta = {field['name'] for field in old_schema['_meta']['type']['fields']}
del old_schema['_meta']
keys_old_schema = set(old_schema.keys())

schema_keys = keys_new_schema | keys_new_schema_meta | keys_old_schema | keys_old_schema_meta

STRUCT_SQL = []

for field in schema_keys:
  in_new = field in keys_new_schema
  in_new_meta = field in keys_new_schema_meta
  in_old = field in keys_old_schema
  in_old_meta = field in keys_old_schema_meta

  _sql = ''

  _end_counter = 0

  if in_new:
    _sql += _append_case_when(field, _end_counter)
    _end_counter += 1

  if in_new_meta:
    _sql += _append_case_when(f"_meta.{field}", _end_counter)
    _end_counter += 1

  if in_old:
    _sql += _append_case_when(f"_source.{field}", _end_counter)
    _end_counter += 1

  if in_old_meta:
    _sql += _append_case_when(f"_source._meta.{field}", _end_counter)
    _end_counter += 1

  _sql += ' ELSE null '

  for x in range(_end_counter): _sql += ' END '

  _sql += f""" as {field}"""

  STRUCT_SQL.append(_sql)

STRUCT_FORMATTED = ',\n'.join(STRUCT_SQL)
SQL = f"""struct({STRUCT_FORMATTED}) AS FACET"""

logs = spark.sql(f'SELECT FACET.* FROM (select {SQL} from logs_json)')

Query example

logs.createOrReplaceTempView('logs')

%sql
select * from logs

Handling JSON schema changes in pyspark

1 Answers1

Loading the data

Defining a helper function

Generating the SQL

Query example