0

Hello Everyone, I'm pretty new to GCP( Moving from Aws to GCP ) and I have a lame question( please excuse me ). We are building a traditional EDW with GCP. As part of scheduler, we have cloud composer and all our codes sits in Compute Engine ( like Ec2 instance in AWS ).

How would I set up a workflow to run my jobs from Compute Engine? or what's the best solution to implement the same?

Some more info on our pipeline: Pipeline 1 : Extracts Millions of rows from sql db(legacy), do some etl logic [ cleansing ,adding a new column,dropping columns, up case column values and so on ] and finally load into redshift

Pipeline 2 : Read data from Googlesheets, perform above etl logic and load into different redshift table[s].

Pipeline 3 : Read data from Google API's, perform cleanup, insert into redshift and so on.

How best I can write my ETL workflows with cloud composer.

Any help is highly thankful!

----------PROJECT STRUCTURE & REQUIREMENTS------------
In my compute Engine I have project like :

    /home/ubunutu/projects/project1
        /venv
        /src/job1.py ( reads googlesheets and loads into cloudsql)
        /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql)
    
    
    /home/ubunutu/projects/project2
        /venv
        /src/job1.py ( Read file from GCS, perform cleaning,adding/remving columns and load into cloudsql)
       /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloudsql table B)
    
    
    
    
     Now in composer, how do I orchestrate the complete work flow? Python jobs sits in Compute engine and I need to execute them.
    
    The reason Why we use compute Engine is to perform some in-memory opearions like reading data in dataframe, do some group by, create new columns, creating temporary files and so on.
    
    or what would be your suggestions?
    As like moving the whole sandbox to composer's /data directory as like,
    /data/projects/project1
        /venv
        /src/job1.py ( reads googlesheets and loads into cloudsql)
        /src/job2.py ( Reads Google Adwords API, do some cleaning, modifying attributes and load into cloudsql)
    
    
    /data/projects/project2
        /venv
        /src/job1.py ( Read file from GCS, perform cleaning,adding/removing columns and load into cloudsql)
        /src/job2.py ( Reads data from a cloudsql table A and perform some modifications and loads into cloud sql table B)
    
    
    In this case,
        1. Will I be able to download any temporary files in composer server and perform some operations on it?
        2. I shall not be needed to create venv If I place my code in composer directly as I can install packages via PyPI in console?

----------------------------------------------------------

Could you please help me out with your valuable knowledge? Thanks a lot in advance!

Thanks a lot in advance!

kylasam
  • 38
  • 4
  • Do you want to run your scripts inside a Compute Engine? Can you explain how are you scripts? I mean which language are you using, which libraries , etc.. – rmesteves Jul 29 '20 at 08:40
  • Where is your problem? To use Composer with your Compute? if so, how do you trigger the code on them? – guillaume blaquiere Jul 29 '20 at 13:33
  • Thank you so much for your valuable reply. let me give more insights. – kylasam Jul 30 '20 at 09:11
  • I have just added couple of more detailed info on our project structure and way of executing the same. Hope this helps me to give a good picture on my requirement! Looking for your help :) thanks in advance – kylasam Jul 30 '20 at 09:47

1 Answers1

1

Here is a design pattern that you can adapt to suit your needs. Task scheduling on Compute Engine with Cloud Scheduler

Assuming you can set up Pub/Sub topics and subscriptions, you can ...

  • Have a DAG in Composer that runs some codes and publishes a message to a pub/sub topic
  • Have a process running in Compute that subscribes to the topic. Upon receiving a message, trigger the script that you need to run.
  • Upon completion, notify a pub/sub topic
  • Have a separate DAG that triggers in Composer upon receiving the message (Note: there are multiple ways to do this. See here).
Dharman
  • 30,962
  • 25
  • 85
  • 135
jayque
  • 56
  • 1
  • 7