0

How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java

Here is my scenario:

  1. A large jsonl file is in google storage
  2. Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline.
  3. Use Apache beam, google dataflow and java to convert jsonl to csv
  4. Once transformation is done, store csv in google storage (same bucket where jsonl is stored)
  5. Notify by some means, like transformation_done=true if possible (rest api or event)

Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.

I have edited the question with an example JSONL data

{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
{"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"}
{"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"} 

While json key's are there in an input file but it's not known while transforming. I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.

Client 1: Input Jsonl File

{"city":"Mumbai", "pincode":"2012A"} 
{"city":"Delhi", "pincode":"2012N"} 

Client 2: Input Jsonl File

{"Relation":"Finance", "Code":"2012A"} 
{"Relation":"Production", "Code":"20XXX"} 

Client 3: Input Jsonl File

{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}

Question: How could I write A Generic beam pipeline which transforms all three as shown below

Client 1: Output CSV file

["city", "pincode"] 
["Mumbai","2012A"] 
["Delhi", "2012N"] 

Client 2: Output CSV file

["Relation", "Code"] 
["Finance", "2012A"] 
["Production","20XXX"] 

Client 3: Output CSV file

["Name", "Session", "Score", "true"]
["Gilbert", "2013", "24", "true"]
["Alexa", "2013", "29", "true"]
Ash
  • 2,095
  • 1
  • 19
  • 17
  • I have some follow up questions: --1. Is this going to be a batch pipeline or a streaming pipeline. --2. In a single new line separated JSON file, are the properties same for all the lines in that file or different for each line? --3. If answer of second is No, then Are there going to be multiple files with different schema in the same bucket folder – Saransh Jun 27 '22 at 05:44
  • Hi Saransh, 1. It's batch pipeline as it's reading from a text file. 2. All properties are same for each line in a jsonl file. – Ash Jul 02 '22 at 07:43
  • The way I was thinking to go about this is get header of the file `PCOllectionView` using a `ParDo` and then pass that as `sideInput` to the `TextIO.write` and use that as header. But seems like that is not yet supported. It;'s being tracked here (https://github.com/apache/beam/issues/18464). – Saransh Jul 03 '22 at 19:26
  • The other solution I could think of is a messy one, and would be more or less like a normal Java Program where the only parallelism would be on per-file basis. Basically reading a file inside a Pardo using the metadata from File.IO.match() and then adding one extra line for the header based on the data present within the file (JSON keys), and then writing the values thereafter(JSON VALUES) in a single file. – Saransh Jul 03 '22 at 19:27
  • If the second approach works for you, I can provide a reference example for that. – Saransh Jul 03 '22 at 19:32

1 Answers1

1

Edit: Removed the previous ans as questions have been modified with examples.

There is no generic way provided by anyone to achieve such result. You have to write the logic yourself depending on your requirements and how you are handling the pipeline.

Below there are some examples but you need to verify these for your case as I have only tried these on a small JSONL file.

TextIO


Approach 1
If you can collect the header value of the output csv then it will be much easier. But getting the header beforehand itself another challenge.

//pipeline
pipeline.apply("ReadJSONLines",
                TextIO.read().from("FILE URL"))
                .apply(ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processLines(@Element String line, OutputReceiver<String> receiver) {                        
                        String values = getCsvLine(line, false);
                        receiver.output(values);

                    }
                }))
            .apply("WriteCSV",
                    TextIO.write().to("FileName")
                            .withSuffix(".csv")
                            .withoutSharding()
                            .withDelimiter(new char[] { '\r', '\n' })
                            .withHeader(getHeader()));
 private static String getHeader() {
        String header = "";
        //your logic to get the header line.
        return header;
    }

probable ways to get the header line(Only assumptions may not work in your case) :

  • You can have a text file in GCS which will store the header of a particular JSON File. And in your logic you can fetch the header by reading the file , check this SO thread about how to read files from GCS
  • You can try to pass the header as a runtime argument but that depends how you are configuring and executing your pipeline.

Approach 2
This is a workaround I found for small JsonFiles(~10k lines). This below example may not work for large files.

final int[] count = { 0 };
pipeline.apply(//read file)
                .apply(ParDo.of(new DoFn<String, String>() {
                    @ProcessElement
                    public void processLines(@Element String line, OutputReceiver<String> receiver) {

                        // check if its the first processing element. If yes then create the header
                        if (count[0] == 0) {
                            String header = getCsvLine(line, true);
                            receiver.output(header);
                            count[0]++;
                        }
                        String values = getCsvLine(line, false);
                        receiver.output(values);

                    }
                }))
            .apply(//write file)

FileIO


As mentioned by Saransh in comments by using FileIO all you have to do is read the JSONL line by line manually and then convert those into comma separated format.EG:

pipeline.apply(FileIO.match().filepattern("FILE PATH"))
        .apply(FileIO.readMatches())
        .apply(FlatMapElements
                .into(TypeDescriptors.strings())
                .via((FileIO.ReadableFile f) -> {

                    List<String> output = new ArrayList<>();
                    try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
                        String line = br.readLine();
                        while (line != null) {
                            
                            if (output.size() == 0) {
                                String header = getCsvLine(line, true);
                                output.add(header);
                            }
                            String result = getCsvLine(line, false);
                            output.add(result);
                            line = br.readLine();
                        }
                    } catch (IOException e) {
                        throw new RuntimeException("Error while reading", e);
                    }
                    return output;
                }))
            .apply(//write to gcs)

In the above examples I have used a getCsvLine method(created for code usability) which takes a single line from the file and converts it into a comma separated format.To parse the JSON object I have used GSON.

/**
 * @param line     take each JSONL line
 * @param isHeader true : Returns output combining the JSON keys || false:
 *                 Returns output combining the JSON values
 **/
public static String getCsvLine(String line, boolean isHeader) {
    List<String> values = new ArrayList<>();
    // convert the line into jsonobject
    JsonObject jsonObject = JsonParser.parseString(line).getAsJsonObject();
    // iterate json object and collect all values
    for (Map.Entry<String, JsonElement> entry : jsonObject.entrySet()) {
        if (isHeader)
            values.add(entry.getKey());
        else
            values.add(entry.getValue().getAsString());
    }
    String result = String.join(",", values);
    return result;
}
Sayan Bhattacharya
  • 1,365
  • 1
  • 4
  • 14
  • Thanks @Syan for adding the pipeline, though I had to edit the question, feel like it's my mistake. Keys are there in jsonl file but I have to write a generic pipeline, so it can transform any jsonl to csv with their json properties. – Ash Jul 02 '22 at 07:41
  • @Ash I have updated my ans. – Sayan Bhattacharya Jul 04 '22 at 09:13
  • thanks @Sayan, Approach 1 seems closer to the real solution for large jsonl files, but approach 2 not, cause reading a large file into BufferedReader, or in another term into memory doesn't seem going to work and that's the reason I started looking into Apache Beam – Ash Jul 09 '22 at 18:43
  • @Ash Yes you right,you have to choose whatever suits your case. Consider Upvoting and accepting the answer if it addresses your question. – Sayan Bhattacharya Jul 11 '22 at 04:23
  • The answer is not complete, until we know how to take headers values using Apache Beam's Pipeline without processing the whole jsonl file. – Ash Jul 14 '22 at 07:34
  • @ApahceBeam document talks about SDF (Splittable Do Function), but it never explains how to just process first line from jsonl, and then stop the process. – Ash Jul 14 '22 at 07:39
  • @Ash You can't ,Thats the whole point . Till now there is no generic way of reading the only first line of any JSON and use that for the rest of the lines. That` s why `Schemas` are being used . `withheader` works better if you already know the headers. As of now either you create `Schemas` for every JSON or create your own logic to get the headers and then pass it in the pipeline. – Sayan Bhattacharya Jul 14 '22 at 07:49
  • Have you tried my answer's `TexIO`'s approach 2 ? I am just curious about how its performing. – Sayan Bhattacharya Jul 14 '22 at 07:56
  • Sayan, I have used Approach 1, where headers value are getting read using stream, which means not reading all lines. For reference I used BufferedReader br = new BufferedReader(Channels.newReader(readChannel, "UTF-8")); Optional optional = br.lines().findAny(); thanks for your help, I'll accept the answer, thanks again – Ash Aug 04 '22 at 10:39