5

I write some data in the Parquet format using Spark SQL where the resulting schema looks like the following:

root
|-- stateLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- countryLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- global: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)

I can also transform the same data into a more flat schema that looks like this:

root
|-- stateLevelCount1: integer (nullable = false)
|-- stateLevelCount2: integer (nullable = false)
|-- stateLevelCount3: integer (nullable = false)
|-- stateLevelCount4: integer (nullable = false)
|-- stateLevelCount5: integer (nullable = false)
|-- countryLevelCount1: integer (nullable = false)
|-- countryLevelCount2: integer (nullable = false)
|-- countryLevelCount3: integer (nullable = false)
|-- countryLevelCount4: integer (nullable = false)
|-- countryLevelCount5: integer (nullable = false)
|-- globalCount1: integer (nullable = false)
|-- globalCount2: integer (nullable = false)
|-- globalCount3: integer (nullable = false)
|-- globalCount4: integer (nullable = false)
|-- globalCount5: integer (nullable = false)

Now when I run a query on the first data set on a column like global.count1, it takes a lot longer than querying globalCount1 in the second data set. Conversely, writing the first data set into Parquet takes a lot shorter than writing the 2nd data set. I know that my data is stored in a columnar fashion due to Parquet, but I was thinking that all the nested columns would be stored together individually. In the 1st data set for instance, it seems to that the whole 'global' column is being stored together as opposed to 'global.count1', 'global.count2' etc. values being stored together. Is this expected behavior?

Emre Colak
  • 814
  • 1
  • 9
  • 15

1 Answers1

0

Interesting. "it takes a lot longer than querying.. " can you please share how much longer? thanks.

Looking at the code https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/RecordReaderImplementation.java#L248 it seems that reading from structures might have some overhead. It shouldn't be "a lot longer" though just looking at the Parquet code.

I think bigger problem is how Spark can push down predicates in such cases. For example, it may not be able to use bloom filters in such cases. Can you please share how you query data in both cases and timings. Which versions of Spark, Parquet, Hadoop etc?

Parquet-1.5 had issue https://issues.apache.org/jira/browse/PARQUET-61 which in some such cases could cause 4-5x slowdown.

Tagar
  • 13,911
  • 6
  • 95
  • 110