Questions tagged [dataproc]
130 questions
3
votes
1 answer
Autoscaling metrics on GCP Dataproc on YARN
Why is GCP Dataproc's cluster auto-scaling using YARN as RM based on memory requests and NOT cores? Is it limitation of Dataproc or YARN or am I missing something?
Reference:…

Jiten Savla
- 33
- 4
3
votes
1 answer
Install PIP packages on Dataproc existing cluster
Is there a way to use
pip install
or something like that to install packages on a existing dataproc cluster? Or will I need to re-create and set the packages on PIP_PACKAGES?

Danilo
- 123
- 7
2
votes
1 answer
Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc
I am on Dataproc managed spark cluster
OS = Ubuntu 18.04
Spark version = 3.3.0
My cluster configuration is as follows:
Master
Memory = 7.5 GiB
Cores = 2
Primary disk size = 32 GB
Workers
Cores = 16
Ram = 16 GiB
Available to Yarn = 13536…

figs_and_nuts
- 4,870
- 2
- 31
- 56
2
votes
0 answers
Server error: Internal server error: module 'google.auth.credentials' has no attribute 'CredentialsWithTokenUri'
I am trying to create Data Proc cluster with following python packages.
"PIP_PACKAGES": "google-cloud-bigquery==3.10.0 google-resumable-media[requests]==2.5.0 google-cloud-storage==2.8.0 google-cloud-secret-manager==2.16.1 google-ads==21.0.0"
The…

Dhomse N
- 23
- 2
2
votes
0 answers
Why pyspark jobs are not running in parallel even the cluster has enough memory in GCP dataproc cluster?
I have a .yaml file which has 5 independent pyspark jobs means all 5 should run concurrently in the GCP dataproc and have scheduled this .yaml file in crontab for every 30 mins.
I have enough memory in cluster as wel to run all these jobs in…

Subhash bhat
- 21
- 2
2
votes
1 answer
Get console job output text from dataproc using rest api
I need to retrieve the dataproc job output text using the rest api. Only able to find logs through cloud logging. Can someone let me know if it is possible to get the job output text retrieved through rest api or not. If yes how?

Help_me_a_bit
- 103
- 5
2
votes
0 answers
Is there any way to get the error code and error message directly from Dataproc API
We are currently creating Dataproc clusters using below sample code,
from google.cloud import dataproc_v1
def sample_create_cluster():
# Create a client
client = dataproc_v1.ClusterControllerClient()
# Initialize request argument(s)
…

ash_ketchum12
- 73
- 6
2
votes
1 answer
ERROR when setting precision and scale for BIGNUMERIC data type in Big Query schema using python
I am running my python code in a GCP DataProc cluster and using the spark-bigquery-with-dependencies_2.12-0.24.2.jar file. I am trying to create a table in BigQuery using the python client library as below:
from google.cloud import bigquery
client…

Ethan Alberto
- 21
- 1
2
votes
0 answers
DataProc YARN UI - wrong number of Vcores
im running a spark application on dataproc cluster (n1-standard-16) 4 machines (3 primary and 1 secondary)
in idle scenario i can see 16 vcores available which is expected.
but when my spark application is running it is going above 16 i.e 32.. like…

Aravind
- 55
- 5
2
votes
1 answer
Suppressing Info logs from BigQuery when using PySpark
Im using Dataproc to fetch data from some BigQuery tables and I'm being inundated with log INFO messages from what I think is the BigQuery connector. I want to shut these off unless i hit an error. For example this is what i get:
22/07/15 14:24:04…

Frank Pinto
- 134
- 12
2
votes
1 answer
Issue with Dataproc Pyspark job not exiting even though logs have errors
Can see error in logs multiple times for my pyspark job in dataproc, but the job doesn't exit and keeps on running for multiple hours.
Any help to solve this is much appreciated.
The data on which the job is running is very small also.
Sometimes…

Help_me_a_bit
- 103
- 5
2
votes
1 answer
Change java version in master node of dataproc
I have created a dataproc cluster in Google cloud and in the master node I can see the java version as 8.
I need to use Java version 11, How can we do that.
Can we edit in the existing cluster or Can we specify it while creating a new cluster.

maxmadroad
- 53
- 6
2
votes
0 answers
Rest to equivalent unix command for dataproc job submit spark
I have configuration and cluster set in GCP and i can submit a spark job, but I am trying to run cloud dataproc job submit spark from my CLI for the same configuration.
I've set the service account in my local, I am just unable to build the…

Sumit Kumar
- 21
- 3
2
votes
2 answers
How to retrieve the jobId of job submitted via Dataproc from within the Spark job
I want to get the jobId of the Spark job which is running from within the Spark Context.
Does Dataproc stores this info in Spark Context?

Shreya Singhal
- 41
- 4
2
votes
0 answers
Overriding Java security properties on a Dataproc cluster to overcome SSL handshake issue with MS SQL 2019
I am stuck with a problem with Java security setting preventing my Dataproc cluster (Image 2.0.32-debian10) running PySpark to connect to SQL Server 2019 with Spark/JDBC connector…

user18456448
- 41
- 1