-1

I want to optimize a query in vertica database. I have table like this

CREATE TABLE data (a INT, b INT, c INT);

and a lot of rows in it (billions)

I fetch some data using whis query

SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)

but it runs slow. The query plan shows something like this

[Cost: 3M, Rows: 3B (NO STATISTICS)]

The same is shown when I perform explain on

SELECT b, c FROM data WHERE a = 1 AND b = 1

It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?

ablazer
  • 1
  • 1

2 Answers2

4

Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.

I would recommend reviewing projection concepts in the documentation.

If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.

For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.

Kermit
  • 33,827
  • 13
  • 85
  • 121
  • well, as far as I understand from docs, projections are for data aggregation (GROUP BY generally). But what type of projection can help me to fetch exact values? explain SELECT b, c FROM data WHERE a = 1 AND b = 1 gives me something like [Cost: 308K, Rows: 172K] (PATH ID: 1) with Filter by a and b. I mentioned indexes because they can find value in similar situations by constant time. And in case of vertica my query time will grow while my data grows? – ablazer Jul 17 '15 at 11:15
  • @ablazer Projections are not for data aggregation; although Live Aggregate Projections allow projections to store aggregated data. Again, Vertica has no concept of indexes. You should try to run your query through Database Designer. – Kermit Jul 17 '15 at 18:55
0

Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:

1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.

2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.

3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.

Here is an image which can help illustrate how these concepts work together.

Vertica Projection Design Components

Pang
  • 9,564
  • 146
  • 81
  • 122
Eli Reiman
  • 164
  • 2
  • 11