Run Python mrjob in a Kubernetes on Hadoop Cluster

Question

I'm exploring this python package mrjob to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.

I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob successfully in the name-node pod from inside.

Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob MapReduce jobs from the Jupyter Notebook.

The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf as follows;

runners:
 hadoop:
  cmdenv:
    PATH: <pod name>:/opt/hadoop

However mrjob is still unable to detect hadoop bin and gives the below error

FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'

So is there a way in which I can configure mrjob to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.

score 1 · Accepted Answer · answered Oct 24 '22 at 17:44

1

mrjob is a wrapper around hadoop-streaming, therefore requires Hadoop binaries to be installed on the server(s) where the code will run (pods here, I guess); including the Juptyer pod that submits the application.

IMO, it would be much easier for you to deploy PySpark/PyFlink/Beam applications in k8s than hadoop-streaming since you don't "need" Hadoop in k8s to run such distributed processes.

Beam would be recommended since it is compatible with GCP DataFlow

answered Oct 24 '22 at 17:44

OneCricketeer

179,855
19
132
245

Thanks for the clarification. I do have a Spark instance running in the Kubernetes cluster which works fine. I wanted to know whether the same setup can be applied to mapReduce as well. – Thisara Watawana Oct 25 '22 at 04:58
1

Sure, it can. But you will need hadoop binaries in each pod, as mentioned. mrjob doesn't come with any Kubernetes functions itself... Or you can [use `mrjob` for Spark](https://mrjob.readthedocs.io/en/latest/guides/spark.html#why-use-mrjob-with-spark) – OneCricketeer Oct 25 '22 at 15:16

Run Python mrjob in a Kubernetes on Hadoop Cluster

1 Answers1