Highest Voted 'google-cloud-dataproc' Questions

7

votes

2 answers

How to connect with JMX remotely to Spark worker on Dataproc

I can connect to the driver just fine by adding the following: spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=9178 \ …

apache-spark hadoop-yarn google-cloud-dataproc

asked Aug 01 '17 at 10:22

habitats

2,203
2
23
31

7

votes

6 answers

Automatically shutdown Google Dataproc cluster after all jobs are completed

How can I programmatically shutdown a Google Dataproc cluster automatically after all jobs have completed? Dataproc provides creation, monitoring and management. But it seems I cannot find out how to delete the cluster.

google-cloud-platform google-cloud-dataproc

asked May 08 '17 at 07:29

Sreenath Chothar

173
2
13

7

votes

1 answer

How to set partition for Window function for PySpark?

I'm running a PySpark job, and I'm getting the following message: WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. What…

apache-spark pyspark apache-spark-sql google-cloud-dataproc

asked Apr 05 '16 at 19:29

cshin9

1,440
5
20
33

7

votes

1 answer

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem. I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook I have…

apache-spark ipython kubernetes google-kubernetes-engine google-cloud-dataproc

asked Feb 25 '16 at 08:35

med

323
4
11

6

votes

0 answers

GCS Connector Hadoop3 hadoop3-2.2.8 - Slow read, exists, rename and create operations

In my Java application I have an implementation for a file-system layer, where my file class is a wrapper for Hadoop filesystem methods. I am upgrading the from hadoop3-1.9.17 to hadoop3-2.2.8 and I am using the shaded jar of the new version. My…

java hadoop google-cloud-storage hdfs google-cloud-dataproc

asked Oct 18 '22 at 14:16

Selim Alawwa

742
1
8
19

6

votes

1 answer

Spark-submit options for gcs-connector to access google storage

I am using spark-job on a self-managed cluster (like local environment) while accessing buckets on google storage. ❯ spark-submit --version Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/…

python pyspark google-cloud-dataproc

asked Sep 14 '21 at 06:42

uchiiii

135
2
7

6

votes

1 answer

delta lake - Insert into sql in pyspark is failing with java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias

Dataproc cluster is create with image 2.0.x with delta io package io.delta:delta-core_2.12:0.7.0 Spark version is 3.1.1 Spark shell initiated with : pyspark --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \ --conf…

apache-spark pyspark databricks google-cloud-dataproc delta-lake

asked Feb 08 '21 at 17:07

Rak

196
2
9

6

votes

1 answer

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Input Data: a hive table (T) with 35 files (~1.5GB each, SequenceFile) files are in a gs bucket default fs.gs.block.size=~128MB all other parameters are default Experiment 1: create a dataproc with 2 workers (4 core per worker) run select…

apache-spark hadoop hive google-cloud-dataproc

asked Oct 16 '20 at 05:00

dykw

1,199
3
13
17

6

votes

1 answer

How to debug a Spark job on Dataproc?

I have a Spark job running on a Dataproc cluster. How do I configure the environment to debug it on my local machine with my IDE?

apache-spark google-cloud-platform google-cloud-dataproc

asked Jul 23 '20 at 10:37

user9734434

6

votes

2 answers

NoClassDefFoundError: org/apache/spark/sql/internal/connector/SimpleTableProvider when running in Dataproc

I am able to run my program in standalone mode. But when I am trying to run in Dataproc in cluster mode, getting following error. PLs help. My build.sbt name := "spark-kafka-streaming" version := "0.1" scalaVersion := "2.12.10" …

apache-spark sbt google-cloud-dataproc

asked Jul 15 '20 at 18:18

Amit Joshi

172
1
14

6

votes

2 answers

How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?

I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as: ...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ...…

apache-spark google-cloud-platform pyspark hadoop-yarn google-cloud-dataproc

asked Jun 23 '20 at 00:33

krishonadish

959
2
9
18

6

votes

2 answers

Component Gateway with DataprocOperator on Airflow

In GCP it is fairly simply to install and run a JupyterHub component from the UI or the gcloud command. I'm trying to script the processus through Airflow and the DataprocClusterCreateOperator, here an extract of the DAG from…

python google-cloud-platform airflow google-cloud-dataproc

asked Jan 02 '20 at 18:11

kwn

909
2
13
25

6

votes

2 answers

Scheduling cron jobs on Google Cloud DataProc

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to…

google-cloud-platform cron google-cloud-dataproc google-cloud-scheduler

asked Nov 18 '19 at 11:00

Alabhya Mishra

61
4

6

votes

1 answer

GCP Dataproc has Druid available in alpha. How to load segments?

The dataproc page describing druid support has no section on how to load data into the cluster. I've been trying to do this using GC Storage, but don't know how to set up a spec for it that works. I'd expect the "firehose" section to have some…

google-cloud-platform google-cloud-dataproc druid

asked Sep 20 '19 at 12:43

radialmind

279
2
15

6

votes

1 answer

Cross account GCS access using Spark on Dataproc

I am trying to ingest data in GCS of account A to BigQuery of account B using Spark running on Dataproc in account B. I have tried to set GOOGLE_APPLICATION_CREDENTIALS to service account key file which allows access to necessary bucket in account…

apache-spark google-cloud-platform google-bigquery google-cloud-storage google-cloud-dataproc

asked Aug 11 '19 at 03:12

Shasankar

672
6
16

Questions tagged [google-cloud-dataproc]