Spark SQL reading parquet tables and csv tables differently

Question

I have two external tables created in the spark-sql. One has a file format of parquet and other one has file format as textfile.

When we extract query plan on these two table, spark treats the two tables differently.

The output of query plan on parquet table is:

== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p
      +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet

== Optimized Logical Plan ==
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], Statistics(sizeInBytes=2.2 KB, hints=none)
+- Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=2.2 KB, hints=none)
   +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet, Statistics(sizeInBytes=2.2 KB, hints=none)

== Physical Plan ==
*FileScan parquet default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea], PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9 = Korea)], PushedFilters: [], ReadSchema: struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca...

The output of query plan on csv table is:

 == Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p3`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p3
      +- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9]

== Optimized Logical Plan ==
Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=1134.0 B, rowCount=3, hints=none)
+- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], Statistics(sizeInBytes=9.6 KB, rowCount=128, hints=none)

== Physical Plan ==
HiveTableScan [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], [isnotnull(country#9), (country#9 = Korea)]

Why is the difference so? Spark version : 2.2.1

score 0 · Answer 1 · answered Aug 21 '18 at 12:48

0

Logically it does not treat them differently.

But they have different internal formats, parquet is columnar optimized, hence a different approach can apply. E.g. pruning in PARQUET.

answered Aug 21 '18 at 12:48

thebluephantom

16,458
8
40
83

Spark SQL reading parquet tables and csv tables differently

1 Answers1