As of the moment, Cloudera Manager’s Backup and Disaster Recovery does not support Google Cloud Storage it is listed in limitations. Please check the whole documentation through this link for Configuring Google Cloud Storage Connectivity.
The above approach will work. We just need to add a few steps to begin with:
- We first need to establish a private link between on-prem network and Google network using Cloud Interconnect or Cloud VPN.
- Dataproc cluster is needed for data transfer.
- Use Google CLI to connect to your master's instance.
- Finally, you can run
DistCp
commands to move your data.
For more detailed information, you may check this full documentation on Using DistCp to copy your data to Cloud Storage.
Google also has its own BDR and you can check this Data Recovery planning guide.
Please be advised that Google Cloud Storage cannot be the default file system for the cluster.
You can also check this link: Working with Google Cloud partners
You could either use the following connectors:
- In a Spark (or PySpark) or Hadoop application using the
gs://
prefix.
- The hadoop shell:
hadoop fs -ls gs://bucket/dir/file
.
- The Cloud Console Cloud Storage browser.
- Using the
gsutil cp
or gsutil rsync
commands.
You can check this full documentation on using connectors.
Let me know if you have questions.