1

I need to get the data from a third party API and ingest it in google BigQuery. Perhaps, I need to automate this process through google services to do it periodically.

I am trying to use Cloud Functions, but it needs a trigger. I have also read about App Engine, but I believe it is not suitable for only one function to make pull requests.

Another doubt is: do I need to load the data into cloud storage or can I load it straight to BigQuery? Should I use Dataflow and make any configuration?

def upload_blob(bucket_name, request_url, destination_blob_name):
    """
    Uploads a file to the bucket.
    """
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    request_json = requests.get(request_url['url'])

    print('File {} uploaded to {}.'.format(
        bucket_name,
        destination_blob_name))

def func_data(request_url):
    BUCKET_NAME = 'dataprep-staging'
    BLOB_NAME = 'any_name'
    BLOB_STR = '{"blob": "some json"}'

    upload_blob(BUCKET_NAME, request_url, BLOB_NAME)
    return f'Success!'

I expect advise about the architecture (google services) that I should use for creating this pipeline. For example, use cloud functions (get the data from API), then schedule a job using service 'X' to input data to storage and finally pull the data from storage.

Eduardo Humberto
  • 425
  • 2
  • 5
  • 16

1 Answers1

2

You can use function. Create an http triggered function and call it periodically with cloud scheduler.

By the way, you can also call http endpoint of appengine or cloud run.

About storage, answer is no. If the API result is not too large for function allowed memory, you can write in /tmp directory and load data to bigquery with this file. You can size your function up to 2go if needed

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • 1
    Thanks @guilhaume! That is the answer I was looking for. Perhaps, cloud storage is the staging area, shouldn't "raw" data be concentrated there? – Eduardo Humberto Aug 26 '19 at 13:57
  • 2
    Storage is cheap. One of best practice (and if you have the cash for) is to keep all, in case of future needs. Your (maybe) never used data can be stored in cold line, the storage is very very affordable. Think to compress them. So, all of this for answering you: you can (must?) keep all your data at different stage. raw, intermediate, and final. Storage is the perfect place for unstructured data. For structure, think about bigQuery (with partitioning!!). Storage price is the same, switch to coldline automatically after 90 days, and it's easy to query. – guillaume blaquiere Aug 26 '19 at 21:12
  • Perfect! Thank you. – Eduardo Humberto Aug 27 '19 at 19:42