0

I have hive external table (s3 files stored in parquet format) created with spark about 30 GB in size and with few hundreds of partitions. However I need to query the data on a non partition column (say SUPPLIER_ID) to see complete transaction history but not specific to a period or date (Partition columns). How can I ensure this query pattern where I am not sure which partition data belongs to on a Hive table?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
vall
  • 1
  • 1

1 Answers1

0

Quoting from: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html

When filtering query results, a consumer of the parquet-mr API (for example, Hive or Spark) can fetch all records from the API and then evaluate each record against the predicates of the filtering condition. However, this requires assembling all records in memory, even non-matching ones. With predicate pushdown, these conditions are passed to the parquet-mr library instead, which evaluates the predicates on a lower level and discards non-matching records without assembling them first.

For example, when evaluating the record {title: "The Starry Night", width: 92, height: 74} against the condition height > 80, it is not necessary to assemble the whole record, because it can be discarded solely based on its height attribute. However, while the condition height > width does not match the record either, predicate pushdown cannot be used in this case, because we need multiple fields of the same record to evaluate the predicate.

Additionally, predicate pushdown also allows discarding whole row groups that cannot contain any matches based on their min/max statistics. For example, if the statistics of a row group include {min: 62, max: 78} for the height column and the filtering condition is height > 80, then none of the records in that row group can match, thus the whole row group can be discarded.

Just write a query and Parquet has a few smarts as listed above. It can skip various groups and uses columnar storage to check things, that is far faster than reading / deserializing the complete row.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • thanks thebluephantom. it sounds like having parquet files is enough for the situation. But i don't see better performance unless you include partition col in filter. Is there any benchmarks or references we can baseline volume of data vs expected query runtime – vall Jun 09 '21 at 19:47
  • you would have to show query as i find it odd – thebluephantom Jun 09 '21 at 19:50