-1

I have a query like

select count(distinct tab1.id)
from tab1 join tab2 on tab1.email = tab2.email_a

but when I change it to

select count(distinct tab1.id)
from tab1 join tab2 on tab1.email = tab2.email_a or tab1.email = tab2.email_b

then suddenly it's somehow wildly inefficient. I know I can write the query using two join statements, but what exactly is Vertica doing in the second statement to make it so horrible?

Hanmyo
  • 543
  • 4
  • 8
  • 18

3 Answers3

1

I would actually expect an OR predicate to perform worse, no matter what the DBMS, actually:

An optimised JOIN operation would - at least normally - rely on a physical design (indexes in other databases, projection design in Vertica) that could support this join - at least partly.

But this goes out the window as soon as you apply any expression on either of the join functions before comparing - and that includes CASTs, functions, math operations, and, for that matter, logical operations like OR.

I have not found, so far, any situation with operations on join operands before applying the comparison, where the risk of confusing the optimiser into choosing an even worse plan is not far too high.

Hence, I would expect the optimiser to take a less than optimal plan....

@Hanmyo - can you find a way to run an explain on the query you intend - once with, once without the OR in the predicate, so we can get the differences illustrated?

Cheers - Marco

marcothesane
  • 6,192
  • 1
  • 11
  • 21
0

or is a performance killer.

How does this work?

select count(tab1.id)
from tab1 
where exists (select 1 from tab2 where tab1.email = tab2.email_a) or 
      exists (select 1 from tab2 where tab1.email = tab2.email_b);

I am guessing that tab1.id is unique, so you don't need select distinct.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • This looks like a random answer. How can you suggest one thing or another without having seen neither explain plan nor projection definition? – mauro Apr 28 '18 at 03:25
  • @mauro . . . I have a pretty good idea of how databases work, although to be honest, I'm not 100% sure about how Vertica works in this case, because I have more limited experience with it. – Gordon Linoff Apr 28 '18 at 22:23
0

If you add the keyword "explain" ahead of each of your statements in turn then you'll see the query plan for each. The cost numbers shown show the relative complexity of executing each portion of the query. So in the portion where you see that the latter has greater costs then that'll give you an indication what's causing it to be slower.

If you the results say that the predicate is out of range then that means that you've not got any statistics gathered for the tables, that they're out of date since last time you updated them, or that simply the value you're looking for doesn't exist. Execute a statement like this to update the statistics: select analyze_statistics('tab1');

You can also update the statistics for a single column in a table, for example: select analyze_statistics('tab1.email');