Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
5
votes
1 answer

Aggregating large Datasets in Spark SQL

Consider the following code: case class Person( personId: Long, name: String, ageGroup: String, gender: String, relationshipStatus: String, country: String, state: String ) case class PerPersonPower(personId: Long, power: Double) val people:…
5
votes
4 answers

How to stop or shut down a Google Dataproc cluster?

The Dataproc clusters I created always show status as "running" on web portal. Is there a way to stop/deprovision a cluster when it is not in use so that it does not burn resources and $$?
sermolin
  • 161
  • 1
  • 2
  • 6
5
votes
3 answers

Hadoop security GroupMappingServiceProvider exception for Spark job via Dataproc API

I am trying to run a Spark job on a google dataproc cluster, but get the following error: Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMapping not…
MRR
  • 397
  • 4
  • 11
5
votes
2 answers

Passing parameters into dataproc pyspark job

How do you pass parameters into the python script being called in a dataproc pyspark job submit? Here is a cmd I've been mucking with: gcloud dataproc jobs submit pyspark --cluster my-dataproc \ file:///usr/test-pyspark.py \ …
Melissa
  • 75
  • 1
  • 5
5
votes
1 answer

How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

I am following the instructions for starting a Google DataProc cluster with an initialization script to start a jupyter…
seandavi
  • 2,818
  • 4
  • 25
  • 52
5
votes
2 answers

How to write a file using FileWriter to google dataproc?

I have a java spark application where the output from the spark job needs to be collected and then saved into a csv file. This is my code below: fileWriter = new FileWriter("gs://dataflow-exp1/google_storage_tests/20170524/outputfolder/Test.csv",…
Vishnu P N
  • 415
  • 5
  • 19
5
votes
1 answer

How to manage conflicting DataProc Guava, Protobuf, and GRPC dependencies

I am working on a scala Spark job which needs to use java library (youtube/vitess) which is dependent upon newer versions of GRPC (1.01), Guava (19.0), and Protobuf (3.0.0) than currently provided on the DataProc 1.1 image. When running the project…
5
votes
1 answer

How to read and write data in Google Cloud Bigtable in PySpark application?

I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector? How can we access Bigtable from a PySpark application?
5
votes
1 answer

How may I connect Google Dataproc cluster from Sparklyr?

I'm new to Spark and GCP. I've tried to connect to it with sc <- spark_connect(master = "IP address") but it obviously couldn't work (e.g. there is no authentication). How should I do that? Is it possible to connect to it from outside Google Cloud?
5
votes
3 answers

Dataproc: configure Spark driver and executor log4j properties

As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not…
Frank
  • 406
  • 2
  • 13
5
votes
1 answer

How do you use the Google DataProc Java Client to submit spark jobs using jar files and classes in associated GS bucket?

I need to trigger Spark Jobs to aggregate data from a JSON file using an API call. I use spring-boot to create the resources. Thus, the steps for the solution is the following: User makes an POST request with a json file as the input The JSON file…
4
votes
1 answer

Googld cloud dataproc serverless (batch) pyspark reads parquet file from google cloud storage (GCS) very slow

I have an inverse frequency parquet file of the wiki corpus on Google Cloud Storage (GCS). I want to load it from GCS to dataproc serverless (batch). However, the time to load the parquet with pyspark.read on dataproc batch is much slower than my…
4
votes
2 answers

Interval 30 days getting converted to Interval 4 weeks 2 days

I am getting an error in a PySpark code when I am using below query in spark.sql clause, and it is supporting in BigQuery when I run it in BigQuery directly. df = spark.sql('''SELECT h.src, AVG(h.norm_los) AS mbr_avg FROM UM h WHERE h.orig_dt <…
4
votes
1 answer

How to force delete dataproc serverless batch

I am running a pyspark dataproc serverless batch. It has been running for too long so I decided to delete it. But neither the GCP console nor the CLI allow me to delete the batch. The command I tried is gcloud dataproc batches delete
Afaq
  • 1,146
  • 1
  • 13
  • 25
4
votes
1 answer

SparkJob on GCP dataproc failing with error - java.lang.NoSuchMethodError: io.netty.buffer.PooledByteBufAllocator.(ZIIIIIIZ)V

I'm running a spark job on GCP dataproc using below command, gcloud dataproc workflow-templates instantiate-from-file --file=job_config.yaml --region us-east1 Below is my job_config.yaml jobs: - sparkJob: args: - filepath mainJarFileUri:…