0

Vectorization in Hive is a feature (available from Hive 0.13.0) that when enabled rather than reading one row at a time reads a block on 1024 rows. This improves the CPU Usage for operations like, scan, filter, join, and aggregations.
It's only available if data is stored in ORC format. So we don't talk about something other than ORC like LazySimple ...

OK, in most cases, enabling it set hive.vectorized.execution.enabled = true; is cool. But I heard that in some particular cases, disabling it is better for both performance and cluster memory. Can anyone list them all here? Thank you.

Our case

In ours, disabling vectorized.execution made our hive query 5 times faster and reduced the memory usage to 3-5%, compared with the previous one.

SET hive.vectorized.execution.enabled=false;

INSERT OVERWRITE TABLE target_table
SELECT
    source_table.field1["key1"],
    source_table.field2,
    unix_timestamp()
FROM
    source_table
WHERE
    source_table.field3 = "valueX"
    AND source_table.field4 = "valueY"
    AND source_table.field5 IS NULL
    AND source_table.field1["key1"] IS NOT NULL
GROUP BY
    source_table.field1["key1"],
    source_table.field2
The Anh Nguyen
  • 748
  • 2
  • 11
  • 27
  • Heard where? Do you have references? – OneCricketeer Apr 19 '23 at 12:58
  • @OneCricketeer: After some experiments on our environment, I had a chat with ChatGPT as well. Let me share our hive query. – The Anh Nguyen Apr 19 '23 at 23:30
  • And what format actually is your table? ORC, Parquet where vectoring is most optimized? Or plaintext JSON/CSV, or binary Avro, where vectoring doesn't really help? More specifically, if you're using nested map/array fields, vectoring won't help since predicates pushdowns are applied at the table columns... You could compare the query explain plan and compare, as well – OneCricketeer Apr 20 '23 at 14:01
  • Yes. My tables are in ORCSerde. It met vectoring requirements, I guess. Thank you. Let me check the EXPLAIN. – The Anh Nguyen Apr 20 '23 at 23:46

0 Answers0