JSON aggregation using s3-dist-cp for Spark application consumption

Question

My spark application running on AWS EMR loads data from JSON array stored in S3. The Dataframe created from it is then processed via Spark engine.

My source JSON data is in the form of multiple S3 objects. I need to compact them into a JSON array to reduce the number of S3 objects to read from within my Spark application. I tried using "s3-dist-cp --groupBy", but the result is a concatenated JSON data which in itself is not a valid JSON file, so I cannot create a Dataframe from it.

Here is simplified example to illustrate it further.

Source data :

S3 Object Record1.json : {"Name" : "John", "City" : "London"}

S3 Object Record2.json : {"Name" : "Mary" , "City" : "Paris"}

s3-dist-cp --src s3://source/ --dest s3://dest/ --groupBy='.*Record.*(\w+)'

Aggregated output

{"Name" : "Mary" , "City" : "Paris"}{"Name" : "John", "City" : "London"}

What I need :

[{"Name" : "John", "City" : "London"},{"Name" : "Mary" , "City" : "Paris"}]

Application code for reference

import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
val schema = new StructType()
                 .add("Name",StringType,true)
                 .add("City",StringType,true)

val df = spark.read.option("multiline","true").schema(schema).json("test.json")
df.show()

Expected output

+----+------+

|Name| City|

+----+------+

|John|London|

|Mary| Paris|

+----+------+

Is s3-dist-cp the right tool for my need? Any other suggestion for aggregating json data to be loaded by Spark app as Dataframe?

Could you solve it? I'm having the same issue, lots of small json files which I need to read and transform in EMR. I have the data partitioned in S3 but there are lots of small files in each partition. Using `s3-dist-cp` to send all the S3 directory to HDFS doesn't finish. And reading directly from spark crashes. Currently I'm iterating over the parent partition and it kind of works, but it is really inefficient. — Camilo Velasquez, Jan 26 '21 at 04:25

OO7 · Answer 1 · 2020-04-07T22:03:00.350

Alternatively you can use regexp_replace to replace a single line string into multiline strings on json format, before that would be transformed into a dataset.

Check for the sample:

val df = spark.read.text("test.json")\
    .withColumn("json", from_json(regexp_replace(col("value"), "\}\{", "\}\n\{"), schema))\
        .select("json.*")

df.show()

About regexp_replace: Pyspark replace strings in Spark dataframe column

JSON aggregation using s3-dist-cp for Spark application consumption

1 Answers1