How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java
Here is my scenario:
- A large jsonl file is in google storage
- Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline.
- Use Apache beam, google dataflow and java to convert jsonl to csv
- Once transformation is done, store csv in google storage (same bucket where jsonl is stored)
- Notify by some means, like transformation_done=true if possible (rest api or event)
Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.
I have edited the question with an example JSONL data
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
{"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"}
{"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"}
While json key's are there in an input file but it's not known while transforming. I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.
Client 1: Input Jsonl File
{"city":"Mumbai", "pincode":"2012A"}
{"city":"Delhi", "pincode":"2012N"}
Client 2: Input Jsonl File
{"Relation":"Finance", "Code":"2012A"}
{"Relation":"Production", "Code":"20XXX"}
Client 3: Input Jsonl File
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
Question: How could I write A Generic beam pipeline which transforms all three as shown below
Client 1: Output CSV file
["city", "pincode"]
["Mumbai","2012A"]
["Delhi", "2012N"]
Client 2: Output CSV file
["Relation", "Code"]
["Finance", "2012A"]
["Production","20XXX"]
Client 3: Output CSV file
["Name", "Session", "Score", "true"]
["Gilbert", "2013", "24", "true"]
["Alexa", "2013", "29", "true"]