Problems running Spark on GCP

Question

We run a number of scripts for every release of our platform and we want to automate the run of these scripts with Snakemake. The plan is to fire up a VM on Google Cloud and run snakemake there, where the location of the input/output files are read from a yaml file.

Things work pretty fine, except for scripts that use pyspark and would read source files from Google buckets. It seem pyspark uses hadoop to read files from gs:// locations therefore hadoop needs to be properly configured with spark.

We've not yet figured out how to build the Hadoop environment properly, when we run the spark script it always says:

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

When using dataproc we don’t have such issues, but we couldn’t reproduce that environment, and I’m not sure if it is possible to submit a full snakemake pipeline to dataproc with all its dependencies.

Is it possible to set up a Hadoop environment without using Dataproc? Do you have any other tips on how to handle this workflow?

"WARN NativeCodeLoader:" is just a warning that you can safely ignore. Do you face any other issues? — Jacek Laskowski, Feb 16 '21 at 13:11
Yes. I am not able to read such gs:// files neither write the output in a bucket. — irenels, Feb 16 '21 at 13:18
What error are you getting with that? And yes, Hadoop can be setup without Dataproc, but I would expect anything able to be submitted there can also run as expected in Dataproc — OneCricketeer, Feb 16 '21 at 15:20
Can you show how you submit the Spark app? What are the dependencies? Any logs while accessing the bucket? Did you set up a service account? — Jacek Laskowski, Feb 16 '21 at 16:29

Problems running Spark on GCP

0 Answers0