I am running MapReduce jobs on Hive and most of the code already resides in a git repo. I know I am able to include instructions in the bootstrap script when spawning up clusters, but is it possible to do all these things:
- Adjust the python path in the bash_profile for the functions in the repo
- Pulling the git repo and as part of the Hive scripts, all the scripts in the repo?
For the second point, how would I reference the script that is in the git repo from my hive script, like a sample one below:
FROM (
MAP
table.values
USING
'python script_from_repo.py'
AS params
FROM
big_table
) ..........;
Really appreciate any help.