We run a number of scripts for every release of our platform and we want to automate the run of these scripts with Snakemake. The plan is to fire up a VM on Google Cloud and run snakemake there, where the location of the input/output files are read from a yaml file.
Things work pretty fine, except for scripts that use pyspark and would read source files from Google buckets. It seem pyspark uses hadoop to read files from gs:// locations therefore hadoop needs to be properly configured with spark.
We've not yet figured out how to build the Hadoop environment properly, when we run the spark script it always says:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
When using dataproc we don’t have such issues, but we couldn’t reproduce that environment, and I’m not sure if it is possible to submit a full snakemake pipeline to dataproc with all its dependencies.
Is it possible to set up a Hadoop environment without using Dataproc? Do you have any other tips on how to handle this workflow?