Highest Voted 'google-cloud-dataproc' Questions

5

votes

1 answer

Aggregating large Datasets in Spark SQL

Consider the following code: case class Person( personId: Long, name: String, ageGroup: String, gender: String, relationshipStatus: String, country: String, state: String ) case class PerPersonPower(personId: Long, power: Double) val people:…

asked Apr 20 '18 at 08:00

d125q

1,666
12
18

5

votes

4 answers

How to stop or shut down a Google Dataproc cluster?

The Dataproc clusters I created always show status as "running" on web portal. Is there a way to stop/deprovision a cluster when it is not in use so that it does not burn resources and $$?

google-cloud-platform google-cloud-dataproc

asked Feb 02 '18 at 02:05

sermolin

161
1
2
6

5

votes

3 answers

Hadoop security GroupMappingServiceProvider exception for Spark job via Dataproc API

I am trying to run a Spark job on a google dataproc cluster, but get the following error: Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMapping not…

java hadoop apache-spark jar google-cloud-dataproc

asked Dec 21 '17 at 16:43

MRR

397
4
11

5

votes

2 answers

Passing parameters into dataproc pyspark job

How do you pass parameters into the python script being called in a dataproc pyspark job submit? Here is a cmd I've been mucking with: gcloud dataproc jobs submit pyspark --cluster my-dataproc \ file:///usr/test-pyspark.py \ …

google-cloud-dataproc

asked Nov 28 '17 at 20:31

Melissa

75
1
5

5

votes

1 answer

How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

I am following the instructions for starting a Google DataProc cluster with an initialization script to start a jupyter…

apache-spark jupyter-notebook google-cloud-dataproc

asked Sep 07 '17 at 20:33

seandavi

2,818
4
25
52

5

votes

2 answers

How to write a file using FileWriter to google dataproc?

I have a java spark application where the output from the spark job needs to be collected and then saved into a csv file. This is my code below: fileWriter = new FileWriter("gs://dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv",…

java google-cloud-dataproc

asked May 24 '17 at 10:43

Vishnu P N

415
5
19

5

votes

1 answer

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project…

apache-spark google-cloud-dataproc google-hadoop vitess

asked Nov 09 '16 at 00:12

Smith

91
1
3

5

votes

1 answer

How to read and write data in Google Cloud Bigtable in PySpark application?

I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector? How can we access Bigtable from a PySpark application?

apache-spark pyspark google-cloud-dataproc google-cloud-bigtable

asked Nov 02 '16 at 03:02

Revan

541
1
5
13

5

votes

1 answer

How may I connect Google Dataproc cluster from Sparklyr?

I'm new to Spark and GCP. I've tried to connect to it with sc <- spark_connect(master = "IP address") but it obviously couldn't work (e.g. there is no authentication). How should I do that? Is it possible to connect to it from outside Google Cloud?

google-cloud-platform google-cloud-dataproc sparklyr

asked Sep 28 '16 at 20:56

Krzysztof Jędrzejewski

698
4
21

5

votes

3 answers

Dataproc: configure Spark driver and executor log4j properties

As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not…

logging google-cloud-dataproc

asked Mar 23 '16 at 08:55

Frank

406
2
13

5

votes

1 answer

How do you use the Google DataProc Java Client to submit spark jobs using jar files and classes in associated GS bucket?

I need to trigger Spark Jobs to aggregate data from a JSON file using an API call. I use spring-boot to create the resources. Thus, the steps for the solution is the following: User makes an POST request with a json file as the input The JSON file…

java json google-compute-engine google-cloud-dataproc

asked Feb 24 '16 at 19:54

pashupati

87
2
7

4

votes

1 answer

Googld cloud dataproc serverless (batch) pyspark reads parquet file from google cloud storage (GCS) very slow

I have an inverse frequency parquet file of the wiki corpus on Google Cloud Storage (GCS). I want to load it from GCS to dataproc serverless (batch). However, the time to load the parquet with pyspark.read on dataproc batch is much slower than my…

apache-spark google-cloud-platform google-cloud-storage google-cloud-dataproc google-cloud-dataproc-serverless

asked Dec 27 '22 at 05:38

Sam

83
1
1
4

4

votes

2 answers

Interval 30 days getting converted to Interval 4 weeks 2 days

I am getting an error in a PySpark code when I am using below query in spark.sql clause, and it is supporting in BigQuery when I run it in BigQuery directly. df = spark.sql('''SELECT h.src, AVG(h.norm_los) AS mbr_avg FROM UM h WHERE h.orig_dt <…

apache-spark pyspark google-bigquery apache-spark-sql google-cloud-dataproc

asked Aug 27 '22 at 22:26

Saurabh

127
7

4

votes

1 answer

How to force delete dataproc serverless batch

I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch. The command I tried is gcloud dataproc batches delete …

google-cloud-dataproc google-cloud-dataproc-serverless

asked Aug 26 '22 at 16:37

Afaq

1,146
1
13
25

4

votes

1 answer

SparkJob on GCP dataproc failing with error - java.lang.NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.(ZIIIIIIZ)V

I'm running a spark job on GCP dataproc using below command, gcloud dataproc workflow-templates instantiate-from-file --file=job_config.yaml --region us-east1 Below is my job_config.yaml jobs: - sparkJob: args: - filepath mainJarFileUri:…

apache-spark google-cloud-platform google-cloud-dataproc

asked Jul 01 '22 at 17:26

saivenkat21

61
2

Questions tagged [google-cloud-dataproc]