0

I am writing SQL queries over a Spark cluster - 5 workers of (8 cores and 32GB memory). No Hive is associated with it. I found the performance of querying from the table and view methods are very different and want to understand their mechanisms.

Following are my experiments:

(a) load data from parquet files (in HDFS) and create tables as

val df = spark.read.parquet(PATH)
df.write.mode("overwrite").saveAsTable(xxx)

and

(b) load data from parquet files (in HDFS) and create temprory views as

val df = spark.read.parquet(PATH)
df.createOrReplaceTempView(xxx)

I tested with TPC-H scale=100. For each case I generate tables/views for all datasets and execute a query example (Q17: select avg(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#11' and p_container = 'SM CAN' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey);).

The performances are:

For (a) the table generation takes about 3 minutes, and the query is finished within 1 second.

For (b) I know the views are lazily generated and the total processing time is about 50 second.

For the physical plans, all are the same except for the data loading (Location) part:

for (a):

(1) Scan parquet default.lineitem
Output [3]: [l_partkey#26852L, l_quantity#26855, l_extendedprice#26856]
Batched: true
Location: InMemoryFileIndex [file:/xxx/spark-warehouse/lineitem]
PushedFilters: [IsNotNull(l_partkey), IsNotNull(l_quantity)]
ReadSchema: struct<l_partkey:bigint,l_quantity:double,l_extendedprice:double>

and for (b):

(1) Scan parquet 
Output [3]: [l_partkey#17L, l_quantity#20, l_extendedprice#21]
Batched: true
Location: InMemoryFileIndex [hdfs://master:9000/tpch100_parquet/lineitem]
PushedFilters: [IsNotNull(l_partkey), IsNotNull(l_quantity)]
ReadSchema: struct<l_partkey:bigint,l_quantity:double,l_extendedprice:double>

It seems that for table method, Spark read from local files on the master machine but how can it be so fast?

My questions are:

(1) What is the internal representation of a table in Spark's core? As I know a table is datasets on disk plus in-memory metadata (if no Hive enabled), what magic makes querying so fast after the tables are created? I tried to cache the views but the performance is as slow as (b).

(2) Without Hive, if I restart the Spark Session, must I recreate the tables from the parquet files again (suffering the generation time) as the in-memory metadata disappear? or are there ways to quickly re-load the tables?

Update:

For those who are interested or facing a similar problem, I found that the .saveAsTable() does not really work as when I run spark.sql("select * from table").show returns an empty table... So there should be no execution performance problem, rather than a data loading problem.

The correct way for me is to write something like df.write.mode("overwrite").option("path","hdfs://master:9000/PATH/table_name").saveAsTable(x._1). Without identifying the path, the table cannot be well generated. And by doing so, querying from tables and views have almost the same performance.

My Final question now becomes how to persist the created tables for the next time I start a Spark Session. I tried in spark-shell that I can directly write spark.sql(select * from former_table) without recreating it. But when comes to spark-submit, it cannot work.

kqboy
  • 1
  • 2
  • How many parquet files do you have at PATH? – partlov Apr 28 '23 at 12:23
  • @partlov The parquet files are actually the "external table" generated by df.write.saveAsTable. Some big datasets can have up to 256+ or 500+ parquet files. – kqboy Apr 28 '23 at 12:50

0 Answers0