converting json based log into column format, i.e., one file per column

Question

example for the log file:

{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}

It will generate 5 files:

timestamp.column
Field1.column
Field_Doc.f1.column
Field_Doc.f2.column
Field_Doc.f3.column

The column file format is as follows:

string fields are separated by a new line '\n' character. Assume that no string value has new line characters, so no need to worry about escaping them
double, integer & boolean fields are represented as a single value per line
null, undefined & empty strings are represented as an empty line

Example content of timestamp.column:

2022-01-14T00:12:21.000
2022-01-18T00:15:51.000

Note: The fields in the log will be dynamic, do not assume that these are the expected properties

can someone tell me how to do this,

the size of the log file is about 4GB to 48GB

if every JSON is in single line then you can `open()` file and use `for line in file` to read line by line and next you can convert line to dictonary using module `json`. And later you have to open `timestamp.column` in append mode `"a"` and write `data["timestamp"] + "\n"` this file. And the same you has to do with fields. You could use `for key, value in data.itemes()` and use `f"{key}.columns"` to creat filename and write `value` in this file — furas, Feb 20 '22 at 14:27
You may need as `value.isinstance(dict)` to check if you don't have `{"f1": 0, "f2": 1.7, "f3": 2}` and run nested `for key, value in data.itemes()` for this dictionary. And if you expect that it may have another dictionary then you may need another nested for-loop - and it can be simpler to use recursion. — furas, Feb 20 '22 at 14:29

furas · Accepted Answer · 2022-02-20T20:03:39.790

If every JSON is in single line then you can open() file and use for line in file: to read line by line - and next you can convert line to dictonary using module json and process it.

You can use for key, value in data: to work with every item separatelly. You can use key to create filename f"{key}.column" and open it in append mode "a" and write str(value) + "\n" in this file.

Because you have nested dictionaries so you need isinstance(value, dict) to check if you don't have {"f1": 0, "f2": 1.7, "f3": 2} and repeate code for this dictionary - and this may need to use recursion.

Minimal working code.

I use io only to simulate file in memory but you should use open(filename)

file_data = '''{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}'''

import json

# --- functions ---

def process_dict(data, prefix=""):
    
    for key, value in data.items():
        
        if prefix:
            key = prefix + "." + key

        if isinstance(value, dict):
            process_dict(value, key)
        else:
            with open(key + '.column', "a") as f:
                f.write(str(value) + "\n")

# --- main ---

#file_obj = open("filename")

import io
file_obj = io.StringIO(file_data)  # emulate file in memory

for line in file_obj:
    data = json.loads(line)
    print(data)
    process_dict(data)
    #process_dict(data, "some prefix for all files")

EDIT:

More universal version - it get function as third parameter so it can be used with different functions

file_data = '''{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}'''

import json

# --- functions ---

def process_dict(data, func, prefix=""):
    
    for key, value in data.items():
        
        if prefix:
           key = prefix + "." + key
        
        if isinstance(value, dict):
            process_dict(value, func, key)
        else:
            func(key, value)

def write_func(key, value):
    with open(key + '.column', "a") as f:
        f.write(str(value) + "\n")

# --- main ---

#file_obj = open("filename")

import io
file_obj = io.StringIO(file_data)  # emulate file in memory

for line in file_obj:
    data = json.loads(line)
    print(data)
    process_dict(data, write_func)
    #process_dict(data, write_func, "some prefix for all files")

Other idea to make it more universal is to create function which flatten dict and create

{'timestamp': '2022-01-14T00:12:21.000', 'Field1': 10, 'Field_Doc.f1': 0}
{'timestamp': '2022-01-18T00:15:51.000', 'Field_Doc.f1': 0, 'Field_Doc.f2': 1.7, 'Field_Doc.f3': 2}

and later use loop to write elements.


file_data = '''{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}'''

import json

# --- functions ---

def flatten_dict(data, prefix=""):
    
    result = {}

    for key, value in data.items():
        
        if prefix:
           key = prefix + "." + key
        
        if isinstance(value, dict):
            result.update( process_dict(value, key) )
        else:
            result[key] = value
            #result.update( {key: value} )
            
    return result

# --- main ---

#file_obj = open("filename")

import io
file_obj = io.StringIO(file_data)  # emulate file in memory

for line in file_obj:
    data = json.loads(line)
    print('before:', data)
    
    data = flatten_dict(data)
    #data = flatten_dict(data, "some prefix for all items")
    print('after :', data)
    
    print('---')
    
    for key, value in data.items():
        with open(key + '.column', "a") as f:
            f.write(str(value) + "\n")

converting json based log into column format, i.e., one file per column

1 Answers1

Linked