Can you change the file metadata on a cloud database using Apache Beam? From what I understand, Beam is used to set up dataflow pipelines for Google Dataflow. But is it possible to use Beam to change the metadata if you have the necessary changes in a CSV file without setting up and running an entire new pipeline? If it is possible, how do you do it?
2 Answers
You could code Cloud Dataflow to handle this but I would not. A simple GCE instance would be easier to develop and run the job. An even better choice might be UDF (see below).
There are some guidelines for when Cloud Dataflow is appropriate:
- Your data is not tabular and you can not use SQL to do the analysis.
- Large portions of the job are parallel -- in other words, you can process different subsets of the data on different machines.
- Your logic involves custom functions, iterations, etc...
- The distribution of the work varies across your data subsets.
Since your task involves modifying a database, I am assuming a SQL database, it would be much easier and faster to write a UDF to process and modify the database.

- 74,467
- 6
- 95
- 159
-
Thanks John. I will look into this further. – not_a_comp_scientist Oct 25 '18 at 14:09
First, Apache Beam does not currently support schema update yet. There is a feature request for some times but no news
Another option is to alter your current dataflow written with Apache Beam pipeline to migrate your table to another (corrected schema) table. This, unfortunately, is not scale if you have a lot of data and plus if you need to frequently change table schema ( renaming columns, renaming table name, changing data types, ..etc).
What I propose is issue SQL queries to update your table schema instead. You can write a bash script using this guide that executes ALTER TABLE
statement.

- 101
- 3