3

How do you access the CDAP REST API of a Cloud Data Fusion instance? I would like to use Cloud Composer to orchestrate my pipelines.

I have an Enterprise Edition instance with private IP enabled, but i'm not able to find any documentation on how to access the REST API.

The instance details page only shows a /22 IP address range - it does not specify a specific IP. Do I access using the IAP protected URL for the UI?

3 Answers3

5

You can get the CDAP API endpoint for your Data Fusion instances using the projects.locations.instances.list method. You can test it with the API Explorer or with curl:

PROJECT=$(gcloud config get-value project)
TOKEN=$(gcloud auth print-access-token)
LOCATION=europe-west4

curl -H "Authorization: Bearer $TOKEN" \
        https://datafusion.googleapis.com/v1beta1/projects/$PROJECT/locations/$LOCATION/instances

{
  "instances": [
    {
      "name": "projects/PROJECT/locations/europe-west4/instances/data-fusion-1",
      "type": "BASIC",
      "networkConfig": {},
      "createTime": "2019-11-10T12:02:55.776479620Z",
      "updateTime": "2019-11-10T12:16:41.560477044Z",
      "state": "RUNNING",
      "serviceEndpoint": "https://data-fusion-1-PROJECT-dot-euw4.datafusion.googleusercontent.com",
      "version": "6.1.0.2",
      "serviceAccount": "cloud-datafusion-management-sa@REDACTED-tp.iam.gserviceaccount.com",
      "displayName": "data-fusion-1",
      "apiEndpoint": "https://data-fusion-1-PROJECT-dot-euw4.datafusion.googleusercontent.com/api"
    }
  ]
}

Note that apiEndpoint is in the form:

https://<INSTANCE_DISPLAY_NAME>-<PROJECT_ID>-dot-<REGION_ACRONYM>.datafusion.googleusercontent.com/api

Now, we can follow the CDAP reference guide to see, for example, the run history for one pipeline:

GET hostname/api/v3/namespaces/namespace-id/apps/pipeline-name/workflows/DataPipelineWorkflow/runs

where hostname is the previously obtained serviceEndpoint, namespace-id will be default for a BASIC instance (with Enterprise you can have different namespaces) and pipeline-name will be BQ-to-GCS in my case:

curl -H "Authorization: Bearer $TOKEN" \
        https://data-fusion-1-$PROJECT-dot-euw4.datafusion.googleusercontent.com/api/v3/namespaces/default/apps/BQ-to-GCS/workflows/DataPipelineWorkflow/runs

[{"runid":"REDACTED","starting":1573395214,"start":1573395401,"end":1573395492,"status":"COMPLETED",
"properties":{"runtimeArgs":"{\"logical.start.time\":\"1573395214003\",\"system.profile.name\":\"SYSTEM:dataproc\"}",
"phase-1":"b8f5c7d1-03c4-11ea-a553-42010aa40019"},"cluster":{"status":"DEPROVISIONED","end":1573395539,"numNodes":3},
"profile":{"profileName":"dataproc","namespace":"system","entity":"PROFILE"}}]]
Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • Thanks Guillem - Is there a pythonic way (say using requests library) to extract - 1. the pipelines within a given namespace 2. All runid of all pipelines 3. Metrics of each run - number of records, time taken etc? – Ananth Nov 25 '21 at 11:07
0

Now there are also Operators for Cloud composer to make API calls to Data Fusion. This makes it much simpler. Link to operators.

Example for orchestrating a start for Data Fusion pipeline in Cloud Composer DAG:

start_pipeline = CloudDataFusionStartPipelineOperator(
    location=LOCATION,
    pipeline_name=PIPELINE_NAME,
    instance_name=INSTANCE_NAME,
    task_id="start_pipeline",
)
Heikura
  • 1,009
  • 3
  • 13
  • 27
0

it's simple:

export AUTH_TOKEN=$(gcloud auth print-access-token)
export INSTANCE_ID=***
export CDAP_ENDPOINT=$(gcloud beta data-fusion instances describe \
  --project=*** \
  --location=europe-west1 \
  --format="value(apiEndpoint)" \
${INSTANCE_ID})

curl -X GET -H "Authorization: Bearer ${AUTH_TOKEN}" \
"${CDAP_ENDPOINT}/v3/namespaces/default/apps"