Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
11
votes
1 answer

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output…
10
votes
1 answer

PySpark print to console

When running a PySpark job on the dataproc server like this gcloud --project dataproc jobs submit pyspark --cluster my print statements don't show up in my terminal. Is there any way to output data…
Roman
  • 8,826
  • 10
  • 63
  • 103
10
votes
3 answers

Guava version while using spark-shell

I'm trying to use the spark-cassandra-connector via spark-shell on dataproc, however I am unable to connect to my cluster. It appears that there is a version mismatch since the classpath is including a much older guava version from somewhere else,…
10
votes
1 answer

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value. This did not seem to work. Do I need to provide some relative path for the…
bjorndv
  • 523
  • 1
  • 5
  • 16
9
votes
1 answer

How does a Dataproc Spark operator return a value and how to capture and return it

How does a Dataproc Spark operator in Airflow return a value and how to capture it. I have a downstream job which capture this result and based on returned value, I've to trigger another job by branch operator.
9
votes
3 answers

ModuleNotFoundError because PySpark serializer is not able to locate library folder

I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc…
Golak Sarangi
  • 809
  • 7
  • 22
9
votes
4 answers

How to run python3 on google's dataproc pyspark

I want to run a pyspark job through Google Cloud Platform dataproc, but I can't figure out how to setup pyspark to run python3 instead of 2.7 by default. The best I've been able to find is adding these initialization commands However, when I ssh…
9
votes
2 answers

Request insufficient authentication scopes when running Spark-Job on dataproc

I am trying to run the spark job on the google dataproc cluster as gcloud dataproc jobs submit hadoop --cluster \ --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ --class org.apache.hadoop.examples.WordCount…
Vishal
  • 1,442
  • 3
  • 29
  • 48
9
votes
1 answer

spark "basePath" option setting

When I do: allf = spark.read.parquet("gs://bucket/folder/*") I get: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: ... And the following message after the list of paths: If provided…
jldupont
  • 93,734
  • 56
  • 203
  • 318
8
votes
4 answers

How to run spark 3.2.0 on google dataproc?

Currently, google dataproc does not have spark 3.2.0 as an image. The latest available is 3.1.2. I want to use the pandas on pyspark functionality that spark has released with 3.2.0. I am doing the following steps to use spark 3.2.0 Created an…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
8
votes
3 answers

Feature Selection in PySpark

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code. from sklearn.feature_selection import RFECV,RFE logreg =…
8
votes
3 answers

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?
8
votes
2 answers

Why does Spark (on Google Dataproc) not use all vcores?

I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use…
borarak
  • 1,130
  • 1
  • 13
  • 24
8
votes
3 answers

How to read simple text file from Google Cloud Storage using Spark-Scala local Program

As given in the below blog, https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported…
8
votes
1 answer

How to get path to the uploaded file

I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the…
orestis
  • 932
  • 2
  • 9
  • 23