google Dataprep: number of instances and architecture optimisation

Question

I have noticed that every destination in Google dataprep (be it manual or scheduled) spins up a compute engine instance. Limit quota for a normal account is 8 instances max.

look at this flow: dataprep flow

Since datawrangling is composed by multiple layers and you might want to materialize intermediate steps with exports, what is the best approach/ architecture to run dataprep flows?

Option A

run 2 separate flows and schedule them with a 15 min. discrepancy:

first flow will export only the final step
other flow will export intermediate steps only

this way you´re not hitting the quota limit but you´re still calculating early stages of the same flow multiple times

Option B

leave the flow as it is and request for more Compute Engine Quota: Computational effort is the same, I will just have more instances running in parallel instead of sequentially

Option C

each step has his own flow + create reference dataset: this way each flow will only run one single step.

E.g. when I run the job "1549_first_repo" I will no longer calculate the 3 previous steps but only the last one: the transformations between the referenced "5912_first" table and "1549_first_repo".

This last option seems to me the most reasonable as each transformation is run once at most, Am I missing something?

and also, is there a way to run each export sequentially instead of in parallel?

-- EDIT 30. May--

it turns out option C is not the way to go as "referencing" is a pure continuation of the previous flow. You could imagine the flow before the referenced dataset and after the referenced dataset as a single flow.

Still trying to figure out how to achieve modularity without redundantly calculating the same operations.

score 1 · Accepted Answer · answered Jul 18 '18 at 16:18

Both options A and B are good, the difference being the quota increase. If you are expecting to upgrade sooner or later, might as well do it sooner.

And other option, if you are familiar with java or python and Dataflow, is to create a pipeline having a combination of numWorkers, workerMachineType, and maxNumWorkers that fits within your trial limit of 8 cores (or virtual CPUs). Here are the pipeline option and here is a tutorial that can give you a better view of the product.

google Dataprep: number of instances and architecture optimisation

1 Answers1