Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
7
votes
2 answers

How to connect with JMX remotely to Spark worker on Dataproc

I can connect to the driver just fine by adding the following: spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ …
habitats
  • 2,203
  • 2
  • 23
  • 31
7
votes
6 answers

Automatically shutdown Google Dataproc cluster after all jobs are completed

How can I programmatically shutdown a Google Dataproc cluster automatically after all jobs have completed? Dataproc provides creation, monitoring and management. But it seems I cannot find out how to delete the cluster.
7
votes
1 answer

How to set partition for Window function for PySpark?

I'm running a PySpark job, and I'm getting the following message: WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. What…
cshin9
  • 1,440
  • 5
  • 20
  • 33
7
votes
1 answer

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem. I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook I have…
6
votes
0 answers

GCS Connector Hadoop3 hadoop3-2.2.8 - Slow read, exists, rename and create operations

In my Java application I have an implementation for a file-system layer, where my file class is a wrapper for Hadoop filesystem methods. I am upgrading the from hadoop3-1.9.17 to hadoop3-2.2.8 and I am using the shaded jar of the new version. My…
6
votes
1 answer

Spark-submit options for gcs-connector to access google storage

I am using spark-job on a self-managed cluster (like local environment) while accessing buckets on google storage. ❯ spark-submit --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/…
uchiiii
  • 135
  • 2
  • 7
6
votes
1 answer

delta lake - Insert into sql in pyspark is failing with java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias

Dataproc cluster is create with image 2.0.x with delta io package io.delta:delta-core_2.12:0.7.0 Spark version is 3.1.1 Spark shell initiated with : pyspark --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ --conf…
6
votes
1 answer

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Input Data: a hive table (T) with 35 files (~1.5GB each, SequenceFile) files are in a gs bucket default fs.gs.block.size=~128MB all other parameters are default Experiment 1: create a dataproc with 2 workers (4 core per worker) run select…
dykw
  • 1,199
  • 3
  • 13
  • 17
6
votes
1 answer

How to debug a Spark job on Dataproc?

I have a Spark job running on a Dataproc cluster. How do I configure the environment to debug it on my local machine with my IDE?
user9734434
6
votes
2 answers

NoClassDefFoundError: org/apache/spark/sql/internal/connector/SimpleTableProvider when running in Dataproc

I am able to run my program in standalone mode. But when I am trying to run in Dataproc in cluster mode, getting following error. PLs help. My build.sbt name := "spark-kafka-streaming" version := "0.1" scalaVersion := "2.12.10" …
Amit Joshi
  • 172
  • 1
  • 14
6
votes
2 answers

How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?

I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as: ...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ...…
6
votes
2 answers

Component Gateway with DataprocOperator on Airflow

In GCP it is fairly simply to install and run a JupyterHub component from the UI or the gcloud command. I'm trying to script the processus through Airflow and the DataprocClusterCreateOperator, here an extract of the DAG from…
kwn
  • 909
  • 2
  • 13
  • 25
6
votes
2 answers

Scheduling cron jobs on Google Cloud DataProc

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to…
6
votes
1 answer

GCP Dataproc has Druid available in alpha. How to load segments?

The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some…
radialmind
  • 279
  • 2
  • 15
6
votes
1 answer

Cross account GCS access using Spark on Dataproc

I am trying to ingest data in GCS of account A to BigQuery of account B using Spark running on Dataproc in account B. I have tried to set GOOGLE_APPLICATION_CREDENTIALS to service account key file which allows access to necessary bucket in account…