3

May I know how to execute HDFS copy commands on DataProc cluster using airflow. After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.

2 Answers2

2

You can execute hdfs commands on dataproc cluster using something like this

gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster -- 
region=europe-west1

The easiest way is [1] via

gcloud dataproc jobs submit pig --execute 'fs -ls /'

or otherwise [2] as a catch-all for other shell commands.

For a single small file

You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:

hdfs dfs -cp gs://<bucket>/<object> <hdfs path>

This works because

hdfs://<master node> 

is the default filesystem. You can explicitly specify the scheme and NameNode if desired:

hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>

For a large file or large directory of files

When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:

hadoop distcp  gs://<bucket>/<directory> <HDFS target directory>

Consider [3] for details.

[1] https://pig.apache.org/docs/latest/cmds.html#fs

[2] https://pig.apache.org/docs/latest/cmds.html#sh

[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

Pooja S
  • 550
  • 2
  • 9
0

I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.

https://big-data-demystified.ninja/2019/11/04/how-to-ssh-to-a-remote-gcp-machine-and-run-a-command-via-airflow/

Airflow Dataproc operator to run shell scripts