Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
8
votes
3 answers

Read from BigQuery into Spark in efficient way?

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is…
8
votes
2 answers

What is default password for Jupyter created on google's data proc

I set data proc using the steps in link here https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook But my jyputer keep asking for password I didn't set any password. I tried my google account password that doesn't work I ran ../root$…
Watt
  • 3,118
  • 14
  • 54
  • 85
8
votes
1 answer

Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link:…
femibyte
  • 3,317
  • 7
  • 34
  • 59
8
votes
1 answer

Google Cloud Dataproc configuration issues

I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would…
7
votes
2 answers

Dataproc does not unpack files passed as Archive

I'm trying to submit Dataproc with .NET spark Job. The command line looks like: gcloud dataproc jobs submit spark \ --cluster= \ --region= \ --class=org.apache.spark.deploy.dotnet.DotnetRunner \ …
dr11
  • 5,166
  • 11
  • 35
  • 77
7
votes
3 answers

Submit a Python project to Dataproc job

I have a python project, whose folder has the structure main_directory - lib - lib.py - run - script.py script.py is from lib.lib import add_two spark = SparkSession \ .builder \ .master('yarn') \ .appName('script') \ …
Galuoises
  • 2,630
  • 24
  • 30
7
votes
2 answers

How to save a spark DataFrame back into a Google BigQuery project using pyspark?

I am loading a dataset from BigQuery and after some transformations, I'd like to save the transformed DataFrame back into BigQuery. Is there a way of doing this? This is how I am loading the data: df = spark.read \ .format('bigquery') \ …
7
votes
1 answer

GCP Dataproc custom image Python environment

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env…
7
votes
3 answers

Error while running PySpark DataProc Job due to python version

I create a dataproc cluster using the following command gcloud dataproc clusters create datascience \ --initialization-actions \ gs://dataproc-initialization-actions/jupyter/jupyter.sh \ However when I submit my PySpark Job I got the following…
Kassem Shehady
  • 760
  • 1
  • 12
  • 24
7
votes
2 answers

How can I install a custom version of Apache Spark on Cloud Dataproc

For one reason or another, I want to install a version of Apache Spark different from the one available on Google Cloud Dataproc. How can I install a custom version of Spark but also maintain compatibility with the Cloud Dataproc tooling?
James
  • 2,321
  • 14
  • 30
7
votes
5 answers

GCP: You do not have sufficient permissions to SSH into this instance

I have a (non-admin) account on one GCP project. When I start the Dataproc cluster, GCP spins up 3 VMs. When I try to access one of the VM via SSH (in browser) I get the following error: I tried to add recommended permissions, but I cannot add the…
7
votes
1 answer

YARN applications cannot start when specifying YARN node labels

I'm trying to use YARN node labels to tag worker nodes, but when I run applications on YARN (Spark or simple YARN app), those applications cannot start. with Spark, when specifying --conf spark.yarn.am.nodeLabelExpression="my-label", the job cannot…
norbjd
  • 10,166
  • 4
  • 45
  • 80
7
votes
1 answer

Passing multiple system properties to google dataproc cluster job

I am trying to submit a spark job on Dataproc cluster. The job needs multiple system properties. I am able to pass just one as follows: gcloud dataproc jobs submit spark \ --cluster \ --class…
7
votes
0 answers

facing error when creatig dataproc cluster on google

When I trying to create the cluster with 1 master and 2 data nodes, I am getting below error: Cannot start master: Insufficient number of DataNodes reporting Worker test-sparkjob-w-0 unable to register with master test-sparkjob-m. This could be…
Skumar
  • 71
  • 2
7
votes
4 answers

Spark UI appears with wrong format (broken CSS)

I am using Apache Spark for the first time. I run my application and when I access localhost:4040 I get what is shown in the picture. I found that maybe setting spark.ui.enabled true could help but I don't know how to do that. Thanks in…