May I know how to execute HDFS copy commands on DataProc cluster using airflow. After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.
2 Answers
You can execute hdfs commands on dataproc cluster using something like this
gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster --
region=europe-west1
The easiest way is [1] via
gcloud dataproc jobs submit pig --execute 'fs -ls /'
or otherwise [2] as a catch-all for other shell commands.
For a single small file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because
hdfs://<master node>
is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
For a large file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consider [3] for details.
[1] https://pig.apache.org/docs/latest/cmds.html#fs
[2] https://pig.apache.org/docs/latest/cmds.html#sh
[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

- 550
- 2
- 9
-
Hi Pooja,Thanks for your answer. – Surendranatha Reddy Chappidi May 06 '21 at 04:10
-
How to execute it using Airflow? – Surendranatha Reddy Chappidi May 06 '21 at 04:11
-
After executing hdfs commands on dataproc as mentioned in the answer above, you need to make use of dataproc operators to execute hdfs commands in airflow. Example:DataProcHadoopOperator helps to start a Hadoop Job on a Cloud DataProc cluster. – Pooja S May 06 '21 at 12:35
I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.

- 107
- 3