Questions tagged [google-cloud-data-fusion]

Google Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. Data Fusion has a visual point-and-click interface, transformation blueprints, and connectors to make ETL pipeline development fast and easy. Cloud Data Fusion is based on the open-source CDAP project.

Google Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. Data Fusion has a visual point-and-click interface, transformation blueprints, and connectors to make ETL pipeline development fast and easy. Cloud Data Fusion is based on the open-source CDAP project.

This tag can be added to any questions related to using/troubleshooting Google Cloud Data Fusion.

Useful links:

445 questions
2
votes
1 answer

How to schedule Dataproc PySpark jobs on GCP using Data Fusion/Cloud Composer

Hello fellow developers, I have recently started learning about GCP and I am working on a POC that requires me to create a pipeline that is able to schedule Dataproc jobs written in PySpark. Currently, I have created a Jupiter notebook on my…
2
votes
0 answers

how to troubleshoot "Communications link failure" error with Cloud Data Fusion

I have two GCP projects. A testing environment, built from scratch and a production environment with an already existing Cloud SQL for MySQL instance. My goal is to set a replication pipeline with Data Fusion to replicate some MySQL table. On the…
2
votes
1 answer

read and transform parquet files in cloud data fusion

Trying to ingest and transform a parquet file in cloud data fusion. I can see that I can ingest the parquet file using the GCS plugin. But when I want to transform it using the wrangler plugin I don't see any capability to do that. Does the wrangler…
Mason
  • 165
  • 1
  • 1
  • 10
2
votes
1 answer

Cloud Data Fusion triggered pipeline - reuse the already provisioned Dataproc clusters

Is there a way to avoid the provisioning step for subsequently triggered outbound pipelines? It looks like when a pipeline triggers an outbound pipeline, it does the provisioning all over again. Can we simply execute the triggered pipeline on the…
2
votes
1 answer

Cloud Data Fusion pricing - development vs execution

we are looking to get some clarity on Cloud Data Fusion pricing. It looks like if we create a Cloud Data Fusion instance, as long as the instance is alive, we incur hourly rates. This can be quite high: $1100 per month for development and $3000 per…
sacoder
  • 159
  • 13
2
votes
1 answer

How to run a preview on private instance?

Our pipeline fetch data from internet. The preview mode doesn't work on my private cloud data fusion instance, I have a timeout each time. The same jobs work when deployed. Note I am obliged to have a private instance. How can I get a preview that…
Arthur
  • 21
  • 3
2
votes
1 answer

Google Cloud Data Fusion - Dynamic arguments based on functions

Good morning all, I'm looking in Google Data Fusion for a way to make dynamic the name of a source file stored on GCS. The files to be processed are named according to their value date, example: 2020-12-10_data.csv My need would be to set the…
Savannah
  • 103
  • 1
  • 1
  • 5
2
votes
2 answers

fetching Cloud Data Fusion Runtime info

I want to pass the runid of Data fusion pipeline to some function upon pipeline completion but i am not able to find any run-time variable which holds this value. Please help!
2
votes
1 answer

How can I speed up the GCP datafusion(datapipeline)?

About 300T of data is being transferred to Big Query using Google Cloud platform datafusion (option: dev). It currently took 34 minutes to process approximately 16GB. It takes about 10 days to process 6T data. What settings can be modified in…
Quack
  • 680
  • 1
  • 8
  • 22
2
votes
3 answers

Pipeline Dependencies in Data Fusion

I have three pipelines in Data Fusion say A,B and C. I want to the Pipeline C to get triggered after execution of Pipeline A and B both Completes. Pipeline triggers are putting the dependency on one pipeline only. Can this be implemented in Data…
2
votes
2 answers

Data Fusion could not parse response from JSON

I am using the CDAP reference to start a Data fusion batch pipeline(GCS to GCS). curl -w "\n" -X POST -H "Authorization: Bearer ${AUTH_TOKEN}" \ "${CDAP_ENDPOINT}/v3/namespaces/default/apps/${PIPELINE_NAME}/workflows/DataPipelineWorkflow/start" \ -d…
2
votes
0 answers

Google Cloud Data Fusion JDBC Connection error with Google Compute Engine deployed MySQL

I have a MySQL database deployed to Google Compute Engine instance and I'm trying to move the data to Big Query for some analysis. I'm trying to get this working with Google Data Fusion but I'm encountering the following error:…
2
votes
2 answers

Connect BigQuery as a source to Data Fusion in another GCP project

I am trying to connect BigQuery of ProjectA to Data Fusion of ProjectB and its asking me to enter a service key file. I have tried to upload the service key file to Cloud Storage of ProjectB and provided the link but it's asking me to provide a…
2
votes
1 answer

Cloud Data Fusion Preview environment

We can configure the compute profile to run the pipeline on a custom cluster that I create, however for preview I cannot specify the compute profile. There are some custom transformations i need to use which requires me to install some external jar…
Trishit Ghosh
  • 235
  • 3
  • 10
2
votes
1 answer

Can't get Cloud Data Fusion run to stop

I have several Fusion pipelines that all do the same basic tasks: insert data into a table in Bigquery, loading it into S3 and then truncating the Bigquery table. Everything looks ok until I get 'pipeline xxx succeed' log but then it goes into a…
EyalMech
  • 21
  • 1