My recommendation is to extract the Oracle table content in files (CSV format for example). Copy the file into Cloud Storage. Then you load them into BigQuery.
Dataflow is useless (expensive, less efficient, take more time) if the transformation that you want to do can be done is SQL.
However, if you need to request external API (for data transformation, such as ML API for example) or if you want to sink the data in another database than BigQuery (Firestore, BigTable, Cloud SQL,...), dataflow is the right tool
EDIT
To go deeper in detail, I assume that the tables are in the same dataset. Then, the code is simple
def hello_gcs_generic(data, context):
client = bigquery.Client()
dataset_id = 'my_dataset'
bucket = data['bucket']
path = data['name']
table_name = path[path.rfind('/')+1:path.rfind(('.'))]
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig(
autodetect=True,
skip_leading_rows=1,
time_partitioning=bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="date" # Name of the column to use for partitioning.
),
source_format=bigquery.SourceFormat.CSV
)
uri = "gs://{}/{}".format(bucket,path)
load_job = client.load_table_from_uri(
uri, dataset_ref.table(table_name), job_config=job_config
) # API request
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
here the Cloud Functions is called on each file drops in the bucket. Therefore, if you drop 300 file in the same time, 300 functions will be triggered and the process will be performed in parallel.
Several points:
- The Table name is equal to the file name.
- By default, the behavior is a write-append (add the data to the table.
- I strongly recommend you to partition your table on a field, here a date field.
- The load job is often quick. If it take more than 9 minutes, use Cloud Run (limit to 15 minutes) or don't wait the end of the load ( delete this
load_job.result()
)
- The CSV start with 1 leading row. Not mandatory but in case of schema auto-detection, that help for naming the column nicely.
note: I assume that all the file dropped into Cloud Storage have to be integrated into BigQuery. If not, you can add filter as describe in one of my articles