Highest Voted 'google-cloud-dataproc' Questions

5

votes

1 answer

Component Gateway activation on dataproc does not work with composer(airflow) operator airflow.providers.google.cloud.operators.dataproc

I’m trying execute this dag bellow. It seems that the operator creating a dataproc cluster does not enable enabling the optional components to enable jupyter notebook and anaconda. I found this code here: Component Gateway with DataprocOperator on…

asked Feb 22 '22 at 12:49

Thiago Positeli De Arruda

51
4

5

votes

1 answer

Custom Container Image for Google Dataproc pyspark Batch Job

I am exploring newly introduced the google dataproc serverless. While sumitting job, I want to use custom images (wanted use --container-image argument) such that all my python libraries and related files already present in the server such that job…

pyspark google-cloud-dataproc google-cloud-dataproc-serverless

asked Feb 17 '22 at 08:41

konkodi

145
3

5

votes

0 answers

What is the recommended cluster size for a Spark job with 35,000 partitions

I'm using Dataproc 1.4 and I have a Spark Job with 35,000 partitions (input size is 3.4 TB). I'm using 120 nodes clusters of n1-standard-4 machines (so 480 cpus). The problem is that I ran into network errors during shuffles (same results with…

apache-spark google-cloud-dataproc

asked Aug 21 '21 at 15:08

Yann Moisan

8,161
8
47
91

5

votes

0 answers

SparkR code fails if Apache Arrow is enabled

I am running gapply function on SparkRDataframe which looks like below df<-gapply(sp_Stack, function(key,e) { Sys.setlocale('LC_COLLATE','C') suppressPackageStartupMessages({ library(Rcpp) library(Matrix) …

apache-spark google-cloud-dataproc sparkr apache-arrow

asked Jul 09 '21 at 08:45

Benak Raj

318
3
15

5

votes

1 answer

Dataproc cluster fails to initialize

With the standard dataproc image 1.5 (Debian 10, Hadoop 2.10, Spark 2.4), a dataproc cluster cannot be created. Region is set to europe-west-2. The stack-driver log says: "Failed to initialize node -m: Component hdfs failed to…

google-cloud-platform google-cloud-dataproc

asked Aug 18 '20 at 14:10

tak

85
6

5

votes

1 answer

Unable to import airflow providers package

I am unable to import airflow providers package for Google. Command I used was pip3 install apache-airflow-backport-providers-google And it gives me the error ERROR: Could not find a version that satisfies the requirement…

python google-cloud-platform airflow google-cloud-dataproc

asked Aug 12 '20 at 09:54

Kshitij Bhadage

410
1
4
16

5

votes

1 answer

Access Google Cloud Kubernetes services from Dataproc

I have a Kubernetes service that collects models. A system that builds these models is a Python Dataproc job. -> I need a way to push the result of the Dataproc job to the model collection service. Question: How do I access the service in the…

google-kubernetes-engine google-cloud-dataproc

asked Apr 20 '20 at 09:13

simsi

533
3
16

5

votes

1 answer

How can I run uncompiled Spark Scala/spark-shell code as a Dataproc job?

Normally, if I'm using Scala for Spark jobs I'll compile a jarfile and submit it with gcloud dataproc jobs submit spark, but sometimes for very lightweight jobs I might be using uncompiled Scala code in a notebook or using the spark-shell REPL,…

scala apache-spark google-cloud-dataproc

asked Mar 08 '20 at 22:12

Dennis Huo

10,517
27
43

5

votes

1 answer

Can I use Cloud Dataproc with an external Hive Metastore?

By default, Cloud Dataproc runs a Hive Metastore local to the Dataproc cluster. This means: The metastore is ephemeral with the cluster It can be a pain to have multiple clusters using a single metastore Is it possible to point Dataproc clusters…

hive google-cloud-sql google-cloud-dataproc hive-metastore

asked Mar 02 '20 at 17:53

James

2,321
14
30

5

votes

2 answers

Can Google Cloud Data Catalog be used as a metadata repository for Dataproc (Spark/Hive/Presto) and also GCS files?

We are using MySQL (Cloud SQL) for the metadata repository for Dataproc. This doesn't store any pieces of information of GCS files which are not part of Hive external tables. Can anyone suggest the best way to store all the file/data details in one…

google-cloud-platform google-cloud-storage google-cloud-dataproc google-data-catalog

asked Jan 31 '20 at 12:27

user3858193

1,320
5
18
50

5

votes

1 answer

Connecting to remote Dataproc master in SparkSession

I created a 3 node (1 master, 2 workers) Apache Spark cluster in on Google Cloud Dataproc. I'm able to submit jobs to the cluster when connecting through ssh with the master, however I can't get it work remotely. I can't find any documentation about…

apache-spark hadoop google-cloud-dataproc

asked Nov 13 '19 at 13:49

Juta

411
1
5
12

5

votes

2 answers

Run Bash script on GCP Dataproc

I want to run shell script on Dataproc which will execute my Pig scripts with arguments. These arguments are always dynamic and are calculated by shell script. Currently this scripts are running on AWS with the help of script-runner.jar. I am not…

apache-pig google-cloud-dataproc

asked Oct 14 '19 at 12:17

Foram Shah

77
1
5

5

votes

2 answers

dataproc job submission failing with 'Not authorized to requested resource', what permission is missing?

We have an existing dataproc estate and we control access using dataproc's predefined roles. We would like to limit the permissions that our userbase have across our GCP projects hence we are replacing use of predefined roles with custom roles. I…

google-cloud-dataproc

asked Sep 10 '19 at 16:46

jamiet

10,501
14
80
159

5

votes

1 answer

Spark set to read from earliest offset - throws error on attempting to consumer an offset no longer available on Kafka

I am currently running a spark job on Dataproc and am getting errors trying to re-join a group and read data from a kafka topic. I have done some digging and am not sure what the issue is. I have auto.offset.reset set to earliest so it should being…

apache-spark apache-kafka streaming google-cloud-dataproc

asked Apr 29 '19 at 16:54

Jared

85
1
5

5

votes

1 answer

Deadlock in Google Storage API

I'm running a spark job on Dataproc which reads lots of files from a bucket and consolidates them to one big file. I'm using google-api-services-storage 1.29.0 by shading it. Until now it worked fine, consolidating ~20-30K files. Today I tried it…

java google-cloud-platform google-cloud-storage google-cloud-dataproc

asked Jul 02 '18 at 12:57

Nira

469
1
6
16

Questions tagged [google-cloud-dataproc]