-1

My boss wants me to do a join on three tables, let's call them tableA, tableB, tableC, which have respectively 74M, 3M and 75M rows.

In case it's useful, the query looks like this :

SELECT A.*, 
   C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
    ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
    ON B."field_join_BC" = C."field_join_BC"

When trying the query on Dataiku Data Science Studio + Vertica, it seems to create temporary data to produce the output, which fills up the 1T of space on the server, bloating it.

My boss doesn't know much about SQL, so he doesn't understand that in the worst case scenario, it can produce a table with 74M*3M*75M = 1.6*10^19 rows, possibly being the problem here (and I'm brand new and I don't know the data yet, so I don't know if the query is likely to produce that many rows or not).

Therefore I would like to know if I have a way of knowing beforehand how many rows will be produced : if I did a COUNT(), such as this, for instance :

SELECT COUNT(*)
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
    ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
    ON B."field_join_BC" = C."field_join_BC"

Does the underlying engine produces the whole dataset, and then counts it ? (which would mean I can't count it beforehand, at least not that way).

Or is it possible that a COUNT() gives me a result ? (because it's not building the dataset but working it out some other way)

(NB : I am currently testing it, but the count has been running for 35mn now)

François M.
  • 4,027
  • 11
  • 30
  • 81
  • "Does the underlying engine produces the whole dataset, and then counts it ?" Maybe - it depends on what plan the engine creates. SQL is a declarative language; you tell it what you want and the engine decides how to produce it. If there are indices on the join column it may not have to scan all of the data. – D Stanley Oct 20 '16 at 16:29
  • It's highly unlikely that there are indices (and I can't really check, I don't have access directly to the Vertica UI...). If we assume there's no indices, it has to scan the data for sure ? – François M. Oct 20 '16 at 16:47
  • Well, yes, because it has to find all of the matching rows in each table. – D Stanley Oct 20 '16 at 16:53
  • That's what I thought. Alright, thank you. – François M. Oct 20 '16 at 17:02

1 Answers1

1

Vertica is a columnar database. Any query you do only needs to look at the columns required to resolve output, joins, predicates, etc.

Vertica also is able to query against encoded data in many cases, avoiding full materialization until it is actually needed.

Counts like that can be very fast in Vertica. You don't really need to jump through hoops, Vertica will only include columns that are actually used. The optimizer won't try to reconstitute the entire row, only the columns it needs.

What's probably happening here is that you have hash joins with rebroadcasting. If your underlying projections do not line up and your sorts are different and you are joining multiple large tables together, just the join itself can be expensive because it has to load it all into hash and do a lot of network rebroadcasting of the data to get the joins to happen on the initiator node.

I would consider running DBD using these queries as input, especially if these are common query patterns. If you haven't run DBD at all yet and are not using custom projections, then your default projections will likely not perform well and cause the situation I mention above.

You can do an explain to see what's going on.

woot
  • 7,406
  • 2
  • 36
  • 55