Questions tagged [dataproc]
130 questions
0
votes
0 answers
Physical mem used % and Physical Vcores Used % in spark 3 yarn
I like to understand,what is "Physical mem used % and Physical Vcores Used %" in spark 3 yarn.
I don't see these metric in spark 2.4 and i could see these new metric in spark 3 yarn.
What is Physical mem used % ?
What is Physical Vcores Used %?
Even…

Mac
- 1
0
votes
1 answer
dataproc hadoop/spark job can not connect to cloudSQL via Private IP
I am facing this issue of setting up private ip access between dataproc and cloud sql with vpc network and peering setup, would really appreciate help since not able to figure this our since last 2 days of debugging, after following pretty much all…

Jay99
- 81
- 2
- 8
0
votes
1 answer
GCP Serverless pyspark : Illegal character in path at index
I'm trying to run a simple hello world python code on Serverless pyspark on GCP using gcloud (from local windows machine).
if __name__ == '__main__':
print("Hello")
This always results in the error
=========== Cloud Dataproc Agent Error…

Pankaj
- 2,220
- 1
- 19
- 31
0
votes
0 answers
How to get dask job logs on GCP logs explorer?
I'm using dataproc to run my dask job written in python, I can catch every logs except those from distributed computing (lazy). I'm using google cloud logging. The way i catch logs :
logging.getLogger(_name_).exception(e,…

Jahd Jabre
- 45
- 2
- 6
0
votes
1 answer
Dataproc pyspark job total bytes billed
I have a pyspark job that I submitted via dataproc. I would like to know how much data did my job use, or in other words, how much is GCP going to bill me.
I looked at the information schema table, those dont show the jobs run via dataproc.
I am…

SomeRandomUser
- 1
- 1
0
votes
0 answers
How to run SHOW PARTITIONS on hive table using pyspark?
I am trying to run SHOW PARTITIONS on hive table using pyspark, but it is failing with the below error. I am using dataproc cluster on GCP to run pyspark job.
ivysettings.xml file not found in HIVE_HOME or…

majain
- 31
- 5
0
votes
0 answers
Pyspark jobs on dataproc using documents from Firestore
I need to run some simple Pyspark jobs on Big Data that are stored in Google's Firestore.
The dataset contains 42 million documents regarding Instagram posts. I want to do some simple aggregations like summing the number of likes per country…
0
votes
0 answers
GCP Dataproc : Scala SSH Tunnel Oracle database
I'm trying to run the Spark job from GCP Dataproc. It's mainly reading data from AWS Oracle databases, so it connects to them via SSH. It's working fine locally, but not in the GCP Dataproc cluster.
import com.jcraft.jsch.Session
import…

Jayachandran Nachimuthu
- 3
- 1
- 4
0
votes
0 answers
Dataproc PHS Yarn RM UI not able to read logs from remote-app-log-dir
I am working on setting up a dataproc PHS for my Spark and Hive applications. I was successfully able to set up the Spark History Server in a standalone dataproc cluster (PHS) by setting up the following…

Vanshaj Bhatia
- 77
- 8
0
votes
0 answers
configuration of yml file for cloud workflow
i want to write on yml file for create a workflow for scheduler et progrmmer my dataflow et dataproc serveless job, can you help me?
i try
# This is a sample cloud workflow YAML file for scheduling Dataflow jobs.
name: Scheduled Dataflow Job
#…

sahar
- 1
0
votes
0 answers
Setup local docker Image for google Dataproc service
I am having trouble setting up the google docker image for Dataproc service. I tried steps on below stackoverflow
https://stackoverflow.com/questions/69555415/gcp-dataproc-base-docker-image/74715158#74715158
but getting an error as below
PS…

Piyush Namra
- 11
- 4
0
votes
0 answers
How to choose preview-debian11 (dataproc-release-2.1) image to create dataproc cluster
How can I get Debian 11(dataproc-release-2.1)image to create dataproc cluster?
I found that dataproc provide preview-debian11 version as the link below.
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters
I need…

Pongthorn Sa
- 35
- 3
0
votes
0 answers
Sqoop AVRO Dependency Issue in GCP dataproc 2.0.49 Image
I am facing Jar dependency issue while connecting Oracle dabase using Sqoop. i am able to connect to database and not able to get the data from Oracle in Avro format.
Error msg as:
[2022-11-22 05:43:40,031] {subprocess.py:92} INFO - Exception in…

Siri Vali
- 1
- 1
0
votes
1 answer
How to use new Spark Context
I am currently running a jupyter notebook on GCP dataproc and hoping to increase the memory available via my config:
I first stopped my spark context:
import pyspark
sc = spark.sparkContext
sc.stop()
Waited until running the next code block so…

Curl
- 105
- 2
0
votes
2 answers
How to configure an alerting policy for failed Dataproc Batch?
I want to alert on failure of any serverless dataproc job. I think that I may need to create a log based metric and then an alerting policy based on that metric.
I tried creating an alerting policy with the filter below:
filter =…

Daniel Fletemier
- 31
- 3