2

We are considering using Dataprep on an automatic schedule in order to wrangle & load a folder of GCS .gz files into Big Query.

The challenge is: how can the source .gz files be moved to cold storage once they are processed ?

I can't find an event that's generated by Dataprep that we could hook-up to in order to perform the archiving task. What would be ideal is if Dataprep could archive the source files by itself.

Any suggestions ?

jldupont
  • 93,734
  • 56
  • 203
  • 318

1 Answers1

2

I don't believe there is a way to get notified when a job is done directly from Dataprep. What you could do instead is poll the underlying dataflow jobs. You could schedule a script to run whenever your scheduled dataprep job runs. Here's a simple example:

#!/bin/bash

# list running dataflow jobs, filter so that only the one with the "dataprep" string in its name is actually listed and keep its id
id=$(gcloud dataflow jobs list --status=active --filter="name:dataprep" | sed -n 2p | cut -f 1 -d " ")
# loop until the state of the job changes to done
until [ $(gcloud dataflow jobs describe $id | grep currentState | head -1 | awk '{print $2}') == "JOB_STATE_DONE" ]
do
# sleep so that you reduce API calls
sleep 5m
done
# send to cold storage, e.g. gsutil mv ...
echo "done"

The problem here is that the above assumes that you only run one dataprep job. If you schedule many concurrent dataprep jobs the script would be more complicated.

Lefteris S
  • 1,614
  • 1
  • 7
  • 14