I'm exploring this python package mrjob
to run MapReduce jobs in python. I've tried running it in the local environment and it works perfectly.
I have Hadoop 3.3 runs on Kubernetes (GKE) cluster. So I also managed to run mrjob
successfully in the name-node pod from inside.
Now, I've got a Jupyter Notebook pod running in the same Kubernetes cluster (same namespace). I wonder whether I can run mrjob
MapReduce jobs from the Jupyter Notebook.
The problem seems to be that I don't have $HADOOP_HOME defined in the Jupyter Notebook environment. So based on the documentation I created a config file called mrjob.conf
as follows;
runners:
hadoop:
cmdenv:
PATH: <pod name>:/opt/hadoop
However mrjob
is still unable to detect hadoop bin and gives the below error
FileNotFoundError: [Errno 2] No such file or directory: 'hadoop'
So is there a way in which I can configure mrjob
to run with my existing Hadoop installation on the GKE cluster? I've tried searching for similar examples but was unable to find one.