How can I efficiently read multiple json files into a Dataframe or JavaRDD?

Question

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this?

DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json");

Or is there a way to read multiple json files into JavaRDD then convert to Dataframe?

score 21 · Answer 1 · answered Nov 14 '15 at 19:45

21

To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.

context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")

answered Nov 14 '15 at 19:45

tjriggs

261
1
6

Thank you, I will be using the s3 eventually but just testing locally right now. I already marked zero323's answer as correct so I could only upvote you. – Abu Sulaiman Nov 14 '15 at 20:20

score 14 · Accepted Answer · answered Nov 14 '15 at 17:14

14

You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.

DataFrameReader also provides json method with a following signature:

json(jsonRDD: JavaRDD[String])

which can be used to parse JSON already loaded into JavaRDD.

answered Nov 14 '15 at 17:14

zero323

322,348
103
959
935

I think that the json method is deprecated, ought better using the format("json") instead. What do you think? – eliasah Nov 14 '15 at 18:26
Oh, well that's good to know. Thank you! I guess I need to learn to read the documentation better before posting a question. – Abu Sulaiman Nov 14 '15 at 20:17

score 9 · Answer 3 · edited Oct 06 '19 at 07:17

9

function spark.read.json accepts list of file as a parameter.

spark.read.json(List_all_json file)

This will read all the files in the list and return a single data frame for all the information in the files.

edited Oct 06 '19 at 07:17

Madhur Bhaiya

28,155
10
49
57

answered Oct 06 '19 at 07:10

asheesh kumar singhal

145
1
4

Picking up an old thread here, but can anyone comment on the partionability/parallelization optimizations that could be made given >1 input file? I have a 500GB file that is made up of 10 or 12 ~50GB files and I'd much rather let the smaller files be if in fact that helps the process. – Buzz Moschetti Mar 26 '21 at 01:45

score 4 · Answer 4 · answered Jul 02 '21 at 17:33

Using pyspark, if you have all the json files in the same folder, you can use df = spark.read.json('folder_path'). This instruction will load all the json files inside the folder.

For reading performance, I recommend you for providing dataframe the schema:

import pyspark.sql.types as T

billing_schema = billing_schema = T.StructType([
  T.StructField('accountId', T.LongType(),True),
  T.StructField('accountName',T.StringType(),True),
  T.StructField('accountOwnerEmail',T.StringType(),True),
  T.StructField('additionalInfo',T.StringType(),True),
  T.StructField('chargesBilledSeparately',T.BooleanType(),True),
  T.StructField('consumedQuantity',T.DoubleType(),True),
  T.StructField('consumedService',T.StringType(),True),
  T.StructField('consumedServiceId',T.LongType(),True),
  T.StructField('cost',T.DoubleType(),True),
  T.StructField('costCenter',T.StringType(),True),
  T.StructField('date',T.StringType(),True),
  T.StructField('departmentId',T.LongType(),True),
  T.StructField('departmentName',T.StringType(),True),
  T.StructField('instanceId',T.StringType(),True),
  T.StructField('location',T.StringType(),True),
  T.StructField('meterCategory',T.StringType(),True),
  T.StructField('meterId',T.StringType(),True),
  T.StructField('meterName',T.StringType(),True),
  T.StructField('meterRegion',T.StringType(),True),
  T.StructField('meterSubCategory',T.StringType(),True),
  T.StructField('offerId',T.StringType(),True),
  T.StructField('partNumber',T.StringType(),True),
  T.StructField('product',T.StringType(),True),
  T.StructField('productId',T.LongType(),True),
  T.StructField('resourceGroup',T.StringType(),True),
  T.StructField('resourceGuid',T.StringType(),True),
  T.StructField('resourceLocation',T.StringType(),True),
  T.StructField('resourceLocationId',T.LongType(),True),
  T.StructField('resourceRate',T.DoubleType(),True),
  T.StructField('serviceAdministratorId',T.StringType(),True),
  T.StructField('serviceInfo1',T.StringType(),True),
  T.StructField('serviceInfo2',T.StringType(),True),
  T.StructField('serviceName',T.StringType(),True),
  T.StructField('serviceTier',T.StringType(),True),
  T.StructField('storeServiceIdentifier',T.StringType(),True),
  T.StructField('subscriptionGuid',T.StringType(),True),
  T.StructField('subscriptionId',T.LongType(),True),
  T.StructField('subscriptionName',T.StringType(),True),
  T.StructField('tags',T.StringType(),True),
  T.StructField('unitOfMeasure',T.StringType(),True)
])

billing_df = spark.read.json('/mnt/billingsources/raw-files/202106/', schema=billing_schema)

on the documentation (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html) I couldn't find the second schema parameter for loading json. Are you sure about it? — Prometheus, Oct 14 '21 at 13:02
@Prometheus here is the documentation https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameReader.json.html?highlight=json#pyspark.sql.DataFrameReader.json — Camilo Soto, Oct 14 '21 at 20:10

score 0 · Answer 5 · answered Sep 11 '19 at 12:33

0

Function json(String... paths) takes variable arguments. (documentation)

So you can change your code like this:

sqlContext.read().json(file1, file2, ...)

answered Sep 11 '19 at 12:33

dmigo

2,849
4
41
62

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

5 Answers5

Linked