2

We have data residing for a table in Azure blob store which acts as a data lake. Data is ingested every 30 min thus forming time partitions as below in UTC

<Container>/<TableName>/y=2020/m=02/d=20/h=01/min=00
<Container>/<TableName>/y=2020/m=02/d=20/h=01/min=30
<Container>/<TableName>/y=2020/m=02/d=20/h=02/min=00 and so on. 

File format used to capture data is orc and partitions of data within a time partition are of equal size.

The use case we have is to capture data at day level in IST using Spark (V 2.3) for processing. Given that data resides in UTC and use case is to process data in IST (+5.30 UTC), a total of 48 time partitions are essential from /h=18/min=30 (previous day) to /h=18/min=00 (next day). There are two options that we have

Option 1 Create data frames for each time partition and union it

df1 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=20/h=18/min=30)
df2 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=20/h=19/min=00) 
..
df48 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=21/h=18/min=00) ..
df = df.union(df1) 
df = df.union(df2) 
..
df = df.union(df48)

Doing so for 48 partitions will yield an entire day's data in df.

Option 2 Capture data at the day level and apply a filter condition for an hour.

df1 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=20/).filter(h>=19 or (h=18 and min=30))
df2 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=21/).filter(h<=17 or (h=18 and min=00))
df = df1.union(df2)

Once data is loaded to memory time taken for processing is the same i.e. ~ 5 min. Time taken to load the data is the bottleneck. Option 1 takes 30 min and Option 2 takes 2 minutes to load to memory.

In a few blogs, we saw that Analyser scans through the entire previous data frame every time union is called. Thus for 48 unions, it scans 1+2+3+47=1128 times. Is this the reason for exponential performance degradation? What does Analyser do, can it be turned off? To make the read function generic for time partitioned data on file store, are there any suggestions or best practices to adopt?

Kedar
  • 63
  • 4

2 Answers2

0

Union of dataframes results in Analyser going over all the preceding dataframes. This is primarily to infer schema from orc files and throw error if there is a mismatch. What we observed was heavy number of file operations during each union.

Option 1 Since there are > 200 file partitions in each time partition, total number of passes made by analyser was 1+2+..+47=1128. That multiplied by 200 is the number of file open-analyse schema-close operations = 225,600. This was the primary cause for option 1 to result in 30 minutes.

Option 2 Option 2 was performing the same operation but over two large dataframes. One of previous day (from 18.30 to 23.30) and other one for next day (from 00.00 to 18.00). This resulted in 22+26=48x200=9,600 file open-analyse schema-close operations.

To mitigate this we specified schema rather than relying on Spark's schema inference mechanism. Both option 1 and option 2 completed in under 2 minutes on specifying the schema.

Learning : Relying on spark's schema inference mechanism is costly if there are significant union/merging of datasets involved. Primarily because of large number of file operations. This can be an optimisation in Spark to avoid traversing the dataframe again if schema has already been inferred in previous operation. Please specify schema to mitigate this.

Kedar
  • 63
  • 4
0

Wait a min...don't the files have some kind of naming convention? I mean, if the files are named essentially the same, except for the hours and minutes.

Like this: filter(h>=19 or (h=18 and min=30))

Just iterate through the files using a wildcard and merge all into a single dataframe.

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "false")
    .option("sep", "|")
    .load("mnt/<Container>/<TableName>/y=2020/m=02/d=20/h*.gz")
    .withColumn("file_name", input_file_name())

If the schema is not in the file itself, or if it is incomplete for some reason, you can create it and override what is in the file.

val customSchema = StructType(Array(
    StructField("field1", StringType, true),
    StructField("field2", StringType, true),
    StructField("field3", StringType, true),
    etc.

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "false")
    .option("sep", "|")
    .load("mnt/<Container>/<TableName>/y=2020/m=02/d=20/h*.gz")
    .withColumn("file_name", input_file_name())

Try that and see how you get along.

ASH
  • 20,759
  • 19
  • 87
  • 200