0

I have a sample (100 row) and three steps in my Recipe; When i run the job to load the data in a table in bigquery; it takes 6mn to create the table. The timelapse is too long for a simple process like the one that i am testing. I am trying to understand if there is a way to speed up the job. Change some settings, increase the size of the machine, run the job at a specific time, ect.

BeeKay
  • 1
  • 2
  • Can you explain how are you loading the data? And provide some code if it is possible. Otherwise is difficult to understand why it is taking long. Can you also elaborate a little bit the three steps mentioned, to better understand the context? – Rubén C. Jun 15 '18 at 08:02
  • Sure. After performing a few transformation into the dataset (in the flow); i run the job. Then i go to "Publishing Actions" where i chose to add a Publishing action, which then allow me to chose the location in Bigquery (table) where the data will be loaded. Then i wait for the status to say "complete". It typically takes between 5mn to 5h depending on the job (number of transformation and datasets) to get the job to complete. I am not sure why it takes so long. Is there a way to improve the time it takes to complete? – BeeKay Jun 18 '18 at 21:34
  • Can you [edit the post](https://stackoverflow.com/help/mcve) to make your details more readable, please? Dataprep is in beta so you shouldn't expect it to be as fast as other products. I will try to reproduce what you do and provide more information. – Rubén C. Jun 19 '18 at 16:00
  • Well, the transformations are happening in the flow so i am not using the (sql) i will try to send a screenshot just to give you an idea. The whole process uses around 200+ transformation and the longest time for one to complete is typically less than 2mn. The challenge is to understand the reason why it is taking so long for the entire job to be successful and for the final output to be available / dumped in a bigquery table. – BeeKay Jun 20 '18 at 17:41

1 Answers1

0

If you look in Google Cloud Platform -> Dataflow -> Your Dataprep Job, you will see a workflow diagram containing computation steps and computation times. For complex flows, you can identify there the operations that take longer to know what to improve.

For small jobs there is not much improvement to do, since setting the environment takes about 4min. You can see on the right side the "Elapsed time" (real time) and a time graph illustrating how much it takes starting and stopping workers.

Rubén C.
  • 1,098
  • 6
  • 16