0

I'm reading JSON log data from s3 bucket. LogDNA are notorious in changing the schema of their log output and recently have done so again.

This presents a situation for me where:

Files dated <=2019-07-29 contain the schema (simplified)

{_source.message: "Hello World"}

and files dating =>2019-07-30 have the schema

{message: "Hello World"}

Spark in turn inferes the following schema (again, simplified)

root
 |-- _source: struct (nullable = true)
 |    |-- message: string (nullable = true)
 |-- message: string (nullable = true)

I would like to map both of these schemas into a single Dataframe column. How should I do this?

Since this schema is automatically discovered a mapping function would be best approach but even manual mapping is ok for me ATM

Maxim Veksler
  • 29,272
  • 38
  • 131
  • 151

1 Answers1

0

I've managed to solve it, it's a bit involved so I'll share it a few bits of the complete solution in the hope that people will find it useful.

The solution is essentially code generation, the SQL flavored kind.

Loading the data

logs_json = spark.read.json("/mnt/seedx-ops-logs-prod/*/*/*.json.gz")
logs_json.registerTempTable("logs_json")

Defining a helper function

# https://stackoverflow.com/questions/39982135/apache-spark-dealing-with-case-statements

def _append_case_when(field, _end_counter):
  _sql = ''

  if _end_counter > 0:
    _sql += ' ELSE '

  _sql += f""" CASE WHEN {field} is not null THEN {field} """

  return _sql

Generating the SQL

root = logs_json.schema.jsonValue()

# Complete Schema
fields_schema = root['fields']

# Old schema is netsted under the {"_source": X} subtree.
_source = next(x for x in fields_schema if x['name'] == '_source')
old_fields_schema = _source['type']['fields']
old_schema = {s['name']: s for s in old_fields_schema}

# New schema
new_schema = {s['name']: s for s in fields_schema}
del new_schema['_source']

keys_new_schema_meta = {field['name'] for field in new_schema['_meta']['type']['fields']}
del new_schema['_meta']
keys_new_schema = set(new_schema.keys())


keys_old_schema_meta = {field['name'] for field in old_schema['_meta']['type']['fields']}
del old_schema['_meta']
keys_old_schema = set(old_schema.keys())

schema_keys = keys_new_schema | keys_new_schema_meta | keys_old_schema | keys_old_schema_meta

STRUCT_SQL = []

for field in schema_keys:
  in_new = field in keys_new_schema
  in_new_meta = field in keys_new_schema_meta
  in_old = field in keys_old_schema
  in_old_meta = field in keys_old_schema_meta

  _sql = ''

  _end_counter = 0

  if in_new:
    _sql += _append_case_when(field, _end_counter)
    _end_counter += 1

  if in_new_meta:
    _sql += _append_case_when(f"_meta.{field}", _end_counter)
    _end_counter += 1

  if in_old:
    _sql += _append_case_when(f"_source.{field}", _end_counter)
    _end_counter += 1

  if in_old_meta:
    _sql += _append_case_when(f"_source._meta.{field}", _end_counter)
    _end_counter += 1

  _sql += ' ELSE null '

  for x in range(_end_counter): _sql += ' END '

  _sql += f""" as {field}"""

  STRUCT_SQL.append(_sql)

STRUCT_FORMATTED = ',\n'.join(STRUCT_SQL)
SQL = f"""struct({STRUCT_FORMATTED}) AS FACET"""

logs = spark.sql(f'SELECT FACET.* FROM (select {SQL} from logs_json)')

Query example

logs.createOrReplaceTempView('logs')

%sql
select * from logs
Maxim Veksler
  • 29,272
  • 38
  • 131
  • 151