1

I have a small pipeline im trying to execute:

  1. file placed into GCS Bucket > 2. Cloud Function triggers Dataflow job when file is placed in GCS bucket (not working) > 3. Writes to Big Query table (this part working)

I've created a Dataflow job through Dataprep as it has nice UI to do all my transformations before writing to a BigQuery table (writing to BigQuery works fine), and the Cloud function triggers when a file is uploaded to the GCS bucket. However the Cloud Function doesn't trigger the Dataflow job (which I wrote in Dataprep).

Please, have a look at my sample code below of my Cloud Function, if I can get any pointers as to why the Dataflow job is not triggering.

/**
 * Triggered from a message on a Cloud Storage bucket.
 *
 * @param {!Object} event The Cloud Functions event.
 * @param {!Function} The callback function.
 */
exports.processFile = (event, callback) => {
  console.log('Processing file: ' + event.data.name);
  callback();

  const google = require('googleapis');

 exports.CF_GCStoDataFlow_v2 = function(event, callback) {
  const file = event.data;
  if (file.resourceState === 'exists' && file.name) {
    google.auth.getApplicationDefault(function (err, authClient, projectId) {
      if (err) {
        throw err;
      }

      if (authClient.createScopedRequired && authClient.createScopedRequired()) {
        authClient = authClient.createScoped([
          'https://www.googleapis.com/auth/cloud-platform',
          'https://www.googleapis.com/auth/userinfo.email'
        ]);
      }

      const dataflow = google.dataflow({ version: 'v1b3', auth: authClient });

      dataflow.projects.templates.create({
        projectId: projectId,
        resource: {
          parameters: {
            inputFile: `gs://${file.bucket}/${file.name}`,
            outputFile: `gs://${file.bucket}/${file.name}`
          },
          jobName: 'cloud-dataprep-csvtobq-v2-281345',
          gcsPath: 'gs://mygcstest-pipeline-staging/temp/'
        }
      }, function(err, response) {
        if (err) {
          console.error("problem running dataflow template, error was: ", err);
        }
        console.log("Dataflow template response: ", response);
        callback();
      });

    });
  }
 };
};

DataProc job

tix
  • 2,138
  • 11
  • 18
BenAhm
  • 151
  • 4
  • 13
  • You have attached Dtatproc job submission UI screenshot. Is this a mistake or do you use Dataproc in your workflow somehow? – Igor Dvorzhak May 08 '18 at 00:28
  • this was for a previous commentor, who suggested activating dataproc jobs, (see below) – BenAhm May 08 '18 at 13:47
  • for this line, `console.log('Processing file: ' + event.data.name);` i got error "Cannot read property 'name' of undefined" – hamedazhar Aug 31 '18 at 11:08

3 Answers3

2

This snippet may help, it uses a different method of the dataflow api (launch), it worked for me, be aware you need to specify template's url and also check the metadata file (you can find it in the same directory as the template when executed through the dataprep interface) file you are including the right parameters

dataflow.projects.templates.launch({
   projectId: projectId,
   location: location,
   gcsPath: jobTemplateUrl,
   resource: {
     parameters: {
       inputLocations : `{"location1" :"gs://${file.bucket}/${file.name}"}`,
       outputLocations: `{"location1" : "gs://${destination.bucket}/${destination.name}"}"}`,
     },
      environment: {
        tempLocation: `gs://${destination.bucket}/${destination.tempFolder}`,
        zone: "us-central1-f"
     },
     jobName: 'my-job-name',

   }
 }
1

Have you submitted you Dataproc job? Has it started running? The below documentation can give some idea to get started!

https://cloud.google.com/dataproc/docs/concepts/jobs/life-of-a-job

Sammy
  • 47
  • 7
  • However im using DataPrep to create the Dataflow jobs: https://cloud.google.com/dataprep/ (not DataProc) – BenAhm May 07 '18 at 18:55
  • It doesn't matter how you create your Dataflow job. In order for the trigger to happen, the Dataproc must be running, so that it can kick start your dataflow job, based on the trigger condition. – Sammy May 07 '18 at 18:59
  • Hi if i have understood you correctly creating a dataproc job is required to trigger a cloud function? if so, ive gone through the options in dataproc (submit job), but its not obviously clear to me how this integrates with cloud functions?. and what kind of job needs to be created just to get the cloud functions to trigger?.. (ive added a screenshot above) please let me know what ive missed ? – BenAhm May 07 '18 at 20:00
  • There are two things going on here. 1) you are trying to create a simple dataflow. 2) you are trying to create a trigger that will trigger this job each time, a triggering condition is met. Your screenshot is gonna work for 1), for 2) you will have to create a trigger (job) and run it. I believe you're missing out on the running part of the trigger job. In essence it is two jobs. – Sammy May 07 '18 at 20:16
  • Follow this thread. This user talks about how he's running his trigger job. https://stackoverflow.com/questions/49348220/google-cloud-functions-cannot-read-property-getapplicationdefault – Sammy May 07 '18 at 20:21
1

Looks like you are putting CF_GCStoDataFlow_v2 inside processFile, so the Dataflow part of the code is not executing.

Your function should look like this:

/**
 * Triggered from a message on a Cloud Storage bucket.
 *
 * @param {!Object} event The Cloud Functions event.
 * @param {!Function} The callback function.
 */
exports.CF_GCStoDataFlow_v2 = (event, callback) => {

  const google = require('googleapis');

  if (file.resourceState === 'exists' && file.name) {
    google.auth.getApplicationDefault(function (err, authClient, projectId) {
      if (err) {
        throw err;
      }

      if (authClient.createScopedRequired && authClient.createScopedRequired()) {
        authClient = authClient.createScoped([
          'https://www.googleapis.com/auth/cloud-platform',
          'https://www.googleapis.com/auth/userinfo.email'
        ]);
      }

      const dataflow = google.dataflow({ version: 'v1b3', auth: authClient });

      dataflow.projects.templates.create({
        projectId: projectId,
        resource: {
          parameters: {
            inputFile: `gs://${file.bucket}/${file.name}`,
            outputFile: `gs://${file.bucket}/${file.name}`
          },
          jobName: '<JOB_NAME>',
          gcsPath: '<BUCKET_NAME>'
        }
      }, function(err, response) {
        if (err) {
          console.error("problem running dataflow template, error was: ", err);
        }
        console.log("Dataflow template response: ", response);
        callback();
      });

    });
  }

  callback();
};

Make sure you change the value under “Function to execute” to CF_GCStoDataFlow_v2

  • made the changes and changed the 'function to execute to CF_GCStoDataFlow_v5, However just got an error in the logs: 'Error: 'Cannot find module 'googleapis' at Function.Module._resolveFilename '. am i missing anything?. do i have to make any changes to my packages.json file?, the log snippet is below, Thanks – BenAhm May 08 '18 at 13:46
  • 'Error: Cannot find module 'googleapis' at Function.Module._resolveFilename (module.js:469) at Function.Module._load (module.js:417) at Module.require (module.js:497) at require (internal/module.js:20) at exports.CF_GCStoDataFlow_v5 (index.js:9) at (/var/tmp/worker/worker.js:705) at (/var/tmp/worker/worker.js:670) at _combinedTickCallback (internal/process/next_tick.js:73) at process._tickDomainCallback (next_tick.js:128)' – BenAhm May 08 '18 at 13:48
  • You need to add googleapi dependencies in the package.json. Example: { "name": "sample-cloud-storage", "version": "0.0.1", "dependencies":{ "googleapis": "^21.3.0" } } – Federico Panunzio May 08 '18 at 14:42