We have data residing for a table in Azure blob store which acts as a data lake. Data is ingested every 30 min thus forming time partitions as below in UTC
<Container>/<TableName>/y=2020/m=02/d=20/h=01/min=00
<Container>/<TableName>/y=2020/m=02/d=20/h=01/min=30
<Container>/<TableName>/y=2020/m=02/d=20/h=02/min=00 and so on.
File format used to capture data is orc and partitions of data within a time partition are of equal size.
The use case we have is to capture data at day level in IST using Spark (V 2.3) for processing. Given that data resides in UTC and use case is to process data in IST (+5.30 UTC), a total of 48 time partitions are essential from /h=18/min=30 (previous day) to /h=18/min=00 (next day). There are two options that we have
Option 1 Create data frames for each time partition and union it
df1 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=20/h=18/min=30)
df2 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=20/h=19/min=00)
..
df48 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=21/h=18/min=00) ..
df = df.union(df1)
df = df.union(df2)
..
df = df.union(df48)
Doing so for 48 partitions will yield an entire day's data in df.
Option 2 Capture data at the day level and apply a filter condition for an hour.
df1 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=20/).filter(h>=19 or (h=18 and min=30))
df2 = SparkRead(<Container>/<TableName>/y=2020/m=02/d=21/).filter(h<=17 or (h=18 and min=00))
df = df1.union(df2)
Once data is loaded to memory time taken for processing is the same i.e. ~ 5 min. Time taken to load the data is the bottleneck. Option 1 takes 30 min and Option 2 takes 2 minutes to load to memory.
In a few blogs, we saw that Analyser scans through the entire previous data frame every time union is called. Thus for 48 unions, it scans 1+2+3+47=1128 times. Is this the reason for exponential performance degradation? What does Analyser do, can it be turned off? To make the read function generic for time partitioned data on file store, are there any suggestions or best practices to adopt?