I have noticed that every destination in Google dataprep (be it manual or scheduled) spins up a compute engine instance. Limit quota for a normal account is 8 instances max.
look at this flow: dataprep flow
Since datawrangling is composed by multiple layers and you might want to materialize intermediate steps with exports, what is the best approach/ architecture to run dataprep flows?
Option A
run 2 separate flows and schedule them with a 15 min. discrepancy:
- first flow will export only the final step
- other flow will export intermediate steps only
this way you´re not hitting the quota limit but you´re still calculating early stages of the same flow multiple times
Option B
leave the flow as it is and request for more Compute Engine Quota: Computational effort is the same, I will just have more instances running in parallel instead of sequentially
Option C
each step has his own flow + create reference dataset: this way each flow will only run one single step.
E.g. when I run the job "1549_first_repo" I will no longer calculate the 3 previous steps but only the last one: the transformations between the referenced "5912_first" table and "1549_first_repo".
This last option seems to me the most reasonable as each transformation is run once at most, Am I missing something?
and also, is there a way to run each export sequentially instead of in parallel?
-- EDIT 30. May--
it turns out option C is not the way to go as "referencing" is a pure continuation of the previous flow. You could imagine the flow before the referenced dataset and after the referenced dataset as a single flow.
Still trying to figure out how to achieve modularity without redundantly calculating the same operations.