Hello Everyone, I'm pretty new to GCP( Moving from Aws to GCP ) and I have a lame question( please excuse me ). We are building a traditional EDW with GCP. As part of scheduler, we have cloud composer and all our codes sits in Compute Engine ( like Ec2 instance in AWS ).
How would I set up a workflow to run my jobs from Compute Engine? or what's the best solution to implement the same?
Some more info on our pipeline: Pipeline 1 : Extracts Millions of rows from sql db(legacy), do some etl logic [ cleansing ,adding a new column,dropping columns, up case column values and so on ] and finally load into redshift
Pipeline 2 : Read data from Googlesheets, perform above etl logic and load into different redshift table[s].
Pipeline 3 : Read data from Google API's, perform cleanup, insert into redshift and so on.
How best I can write my ETL workflows with cloud composer.
Any help is highly thankful!
----------PROJECT STRUCTURE & REQUIREMENTS------------ In my compute Engine I have project like : /home/ubunutu/projects/project1 /venv /src/job1.py ( reads googlesheets and loads into cloudsql) /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql) /home/ubunutu/projects/project2 /venv /src/job1.py ( Read file from GCS, perform cleaning,adding/remving columns and load into cloudsql) /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloudsql table B) Now in composer, how do I orchestrate the complete work flow? Python jobs sits in Compute engine and I need to execute them. The reason Why we use compute Engine is to perform some in-memory opearions like reading data in dataframe, do some group by, create new columns, creating temporary files and so on. or what would be your suggestions? As like moving the whole sandbox to composer's /data directory as like, /data/projects/project1 /venv /src/job1.py ( reads googlesheets and loads into cloudsql) /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql) /data/projects/project2 /venv /src/job1.py ( Read file from GCS, perform cleaning,adding/removing columns and load into cloudsql) /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloud sql table B) In this case, 1. Will I be able to download any temporary files in composer server and perform some operations on it? 2. I shall not be needed to create venv If I place my code in composer directly as I can install packages via PyPI in console? ----------------------------------------------------------
Could you please help me out with your valuable knowledge? Thanks a lot in advance!
Thanks a lot in advance!