My boss wants me to do a join on three tables, let's call them tableA
, tableB
, tableC
, which have respectively 74M, 3M and 75M rows.
In case it's useful, the query looks like this :
SELECT A.*,
C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
When trying the query on Dataiku Data Science Studio
+ Vertica
, it seems to create temporary data to produce the output, which fills up the 1T of space on the server, bloating it.
My boss doesn't know much about SQL, so he doesn't understand that in the worst case scenario, it can produce a table with 74M*3M*75M = 1.6*10^19 rows, possibly being the problem here (and I'm brand new and I don't know the data yet, so I don't know if the query is likely to produce that many rows or not).
Therefore I would like to know if I have a way of knowing beforehand how many rows will be produced : if I did a COUNT()
, such as this, for instance :
SELECT COUNT(*)
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
Does the underlying engine produces the whole dataset, and then counts it ? (which would mean I can't count it beforehand, at least not that way).
Or is it possible that a COUNT()
gives me a result ? (because it's not building the dataset but working it out some other way)
(NB : I am currently testing it, but the count has been running for 35mn now)