Questions tagged [dataproc]

130 questions
0
votes
0 answers

Physical mem used % and Physical Vcores Used % in spark 3 yarn

I like to understand,what is "Physical mem used % and Physical Vcores Used %" in spark 3 yarn. I don't see these metric in spark 2.4 and i could see these new metric in spark 3 yarn. What is Physical mem used % ? What is Physical Vcores Used %? Even…
Mac
  • 1
0
votes
1 answer

dataproc hadoop/spark job can not connect to cloudSQL via Private IP

I am facing this issue of setting up private ip access between dataproc and cloud sql with vpc network and peering setup, would really appreciate help since not able to figure this our since last 2 days of debugging, after following pretty much all…
0
votes
1 answer

GCP Serverless pyspark : Illegal character in path at index

I'm trying to run a simple hello world python code on Serverless pyspark on GCP using gcloud (from local windows machine). if __name__ == '__main__': print("Hello") This always results in the error =========== Cloud Dataproc Agent Error…
Pankaj
  • 2,220
  • 1
  • 19
  • 31
0
votes
0 answers

How to get dask job logs on GCP logs explorer?

I'm using dataproc to run my dask job written in python, I can catch every logs except those from distributed computing (lazy). I'm using google cloud logging. The way i catch logs : logging.getLogger(_name_).exception(e,…
0
votes
1 answer

Dataproc pyspark job total bytes billed

I have a pyspark job that I submitted via dataproc. I would like to know how much data did my job use, or in other words, how much is GCP going to bill me. I looked at the information schema table, those dont show the jobs run via dataproc. I am…
0
votes
0 answers

How to run SHOW PARTITIONS on hive table using pyspark?

I am trying to run SHOW PARTITIONS on hive table using pyspark, but it is failing with the below error. I am using dataproc cluster on GCP to run pyspark job. ivysettings.xml file not found in HIVE_HOME or…
majain
  • 31
  • 5
0
votes
0 answers

Pyspark jobs on dataproc using documents from Firestore

I need to run some simple Pyspark jobs on Big Data that are stored in Google's Firestore. The dataset contains 42 million documents regarding Instagram posts. I want to do some simple aggregations like summing the number of likes per country…
0
votes
0 answers

GCP Dataproc : Scala SSH Tunnel Oracle database

I'm trying to run the Spark job from GCP Dataproc. It's mainly reading data from AWS Oracle databases, so it connects to them via SSH. It's working fine locally, but not in the GCP Dataproc cluster. import com.jcraft.jsch.Session import…
0
votes
0 answers

Dataproc PHS Yarn RM UI not able to read logs from remote-app-log-dir

I am working on setting up a dataproc PHS for my Spark and Hive applications. I was successfully able to set up the Spark History Server in a standalone dataproc cluster (PHS) by setting up the following…
0
votes
0 answers

configuration of yml file for cloud workflow

i want to write on yml file for create a workflow for scheduler et progrmmer my dataflow et dataproc serveless job, can you help me? i try # This is a sample cloud workflow YAML file for scheduling Dataflow jobs. name: Scheduled Dataflow Job #…
0
votes
0 answers

Setup local docker Image for google Dataproc service

I am having trouble setting up the google docker image for Dataproc service. I tried steps on below stackoverflow https://stackoverflow.com/questions/69555415/gcp-dataproc-base-docker-image/74715158#74715158 but getting an error as below PS…
0
votes
0 answers

How to choose preview-debian11 (dataproc-release-2.1) image to create dataproc cluster

How can I get Debian 11(dataproc-release-2.1)image to create dataproc cluster? I found that dataproc provide preview-debian11 version as the link below. https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-version-clusters I need…
0
votes
0 answers

Sqoop AVRO Dependency Issue in GCP dataproc 2.0.49 Image

I am facing Jar dependency issue while connecting Oracle dabase using Sqoop. i am able to connect to database and not able to get the data from Oracle in Avro format. Error msg as: [2022-11-22 05:43:40,031] {subprocess.py:92} INFO - Exception in…
Siri Vali
  • 1
  • 1
0
votes
1 answer

How to use new Spark Context

I am currently running a jupyter notebook on GCP dataproc and hoping to increase the memory available via my config: I first stopped my spark context: import pyspark sc = spark.sparkContext sc.stop() Waited until running the next code block so…
0
votes
2 answers

How to configure an alerting policy for failed Dataproc Batch?

I want to alert on failure of any serverless dataproc job. I think that I may need to create a log based metric and then an alerting policy based on that metric. I tried creating an alerting policy with the filter below: filter =…
1 2 3
8 9