Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
5
votes
1 answer

Component Gateway activation on dataproc does not work with composer(airflow) operator airflow.providers.google.cloud.operators.dataproc

I’m trying execute this dag bellow. It seems that the operator creating a dataproc cluster does not enable enabling the optional components to enable jupyter notebook and anaconda. I found this code here: Component Gateway with DataprocOperator on…
5
votes
1 answer

Custom Container Image for Google Dataproc pyspark Batch Job

I am exploring newly introduced the google dataproc serverless. While sumitting job, I want to use custom images (wanted use --container-image argument) such that all my python libraries and related files already present in the server such that job…
5
votes
0 answers

What is the recommended cluster size for a Spark job with 35,000 partitions

I'm using Dataproc 1.4 and I have a Spark Job with 35,000 partitions (input size is 3.4 TB). I'm using 120 nodes clusters of n1-standard-4 machines (so 480 cpus). The problem is that I ran into network errors during shuffles (same results with…
Yann Moisan
  • 8,161
  • 8
  • 47
  • 91
5
votes
0 answers

SparkR code fails if Apache Arrow is enabled

I am running gapply function on SparkRDataframe which looks like below df<-gapply(sp_Stack, function(key,e) { Sys.setlocale('LC_COLLATE','C') suppressPackageStartupMessages({ library(Rcpp) library(Matrix) …
5
votes
1 answer

Dataproc cluster fails to initialize

With the standard dataproc image 1.5 (Debian 10, Hadoop 2.10, Spark 2.4), a dataproc cluster cannot be created. Region is set to europe-west-2. The stack-driver log says: "Failed to initialize node -m: Component hdfs failed to…
tak
  • 85
  • 6
5
votes
1 answer

Unable to import airflow providers package

I am unable to import airflow providers package for Google. Command I used was pip3 install apache-airflow-backport-providers-google And it gives me the error ERROR: Could not find a version that satisfies the requirement…
5
votes
1 answer

Access Google Cloud Kubernetes services from Dataproc

I have a Kubernetes service that collects models. A system that builds these models is a Python Dataproc job. -> I need a way to push the result of the Dataproc job to the model collection service. Question: How do I access the service in the…
simsi
  • 533
  • 3
  • 16
5
votes
1 answer

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL,…
Dennis Huo
  • 10,517
  • 27
  • 43
5
votes
1 answer

Can I use Cloud Dataproc with an external Hive Metastore?

By default, Cloud Dataproc runs a Hive Metastore local to the Dataproc cluster. This means: The metastore is ephemeral with the cluster It can be a pain to have multiple clusters using a single metastore Is it possible to point Dataproc clusters…
James
  • 2,321
  • 14
  • 30
5
votes
2 answers

Can Google Cloud Data Catalog be used as a metadata repository for Dataproc (Spark/Hive/Presto) and also GCS files?

We are using MySQL (Cloud SQL) for the metadata repository for Dataproc. This doesn't store any pieces of information of GCS files which are not part of Hive external tables. Can anyone suggest the best way to store all the file/data details in one…
5
votes
1 answer

Connecting to remote Dataproc master in SparkSession

I created a 3 node (1 master, 2 workers) Apache Spark cluster in on Google Cloud Dataproc. I'm able to submit jobs to the cluster when connecting through ssh with the master, however I can't get it work remotely. I can't find any documentation about…
Juta
  • 411
  • 1
  • 5
  • 12
5
votes
2 answers

Run Bash script on GCP Dataproc

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script. Currently this scripts are running on AWS with the help of script-runner.jar. I am not…
Foram Shah
  • 77
  • 1
  • 5
5
votes
2 answers

dataproc job submission failing with 'Not authorized to requested resource', what permission is missing?

We have an existing dataproc estate and we control access using dataproc's predefined roles. We would like to limit the permissions that our userbase have across our GCP projects hence we are replacing use of predefined roles with custom roles. I…
jamiet
  • 10,501
  • 14
  • 80
  • 159
5
votes
1 answer

Spark set to read from earliest offset - throws error on attempting to consumer an offset no longer available on Kafka

I am currently running a spark job on Dataproc and am getting errors trying to re-join a group and read data from a kafka topic. I have done some digging and am not sure what the issue is. I have auto.offset.reset set to earliest so it should being…
5
votes
1 answer

Deadlock in Google Storage API

I'm running a spark job on Dataproc which reads lots of files from a bucket and consolidates them to one big file. I'm using google-api-services-storage 1.29.0 by shading it. Until now it worked fine, consolidating ~20-30K files. Today I tried it…