Questions tagged [dataproc]

130 questions
0
votes
0 answers

Not able to read multiple files from azure blob with https signed URL from dataproc pyspark

I am only having access to signed HTTPS urls for csv files (seperate for each file) ex: https://.blob.core.windows.net//.csv?sig=****st=****&se=****&sv=****&sp=r&sr=b Below is the code I am using: for…
0
votes
1 answer

how to submit a pyspark job in google cloud shell; how to pass files and arguments in pyspark submit; how to read them in pyspark code

Can anyone please help how to submit a pyspark job in google cloud shell pass files and arguments in pyspark submit read those files and arguments in pyspark code
0
votes
0 answers

dataproc serverless - spark driver has 0 core & won't finish even if all tasks have been successfully completed

I am running catboost with pyspark on dataproc serverless. Everything works perfectly except the batch job will run indefinitely even if all tasks have been completed. I have tried os._exit(0) or spark.stop() to manually kill Spark but it didn't…
ellos98
  • 1
  • 2
0
votes
0 answers

Item not found while performing a Pyspark dataframe union and and writing it on a csv file

Within the code I'm developing, I create a dataframe and need to join it to an existing one within a Google Cloud Storage bucket. If the dataframe is present (if count > 0), I am going to perform a union between the dataframe I obtained with the…
0
votes
0 answers

Structured Streaming is failing but DataProc continue with Running status

We are migrating 2 Spark Streaming jobs using Structured Streaming from on-prem to GCP. One of them stream messages from Kafka and saves in GCS. And the other, stream from GCS and save in BigQuery. Sometimes this jobs fails because of any problem,…
0
votes
1 answer

How to use --properties-file flag in dataproc?

While doing spark-submit, Gcloud gives an option to use --properties-file to pass the cluster properties and spark configurations. I am not sure how to use it while running the job.
0
votes
2 answers

Dataproc Workflow(ephemeral cluster) or Dataproc Serverless for batch processing?

GCP Dataproc offers both serverless (Dataproc Serverless) & ephemeral cluster (Dataproc Workflow template) for spark batch processing. If Dataproc serverless can hide infrastructure complexity, I wonder what could be the business usecase for using…
0
votes
0 answers

Dataproc Spark Data Reads from SQL Server Is Very Slow When Writing Output As Parquet Files

I am reading data from sql server containing 5M rows and upwards which takes about a hour to read and write to parquet using spark in dataproc. I increased the number of workers for dataproc to 10, increased fetchsize and batchsize 500k and the…
wmorris
  • 27
  • 4
0
votes
1 answer

Couldn't connect to dpms while creating dataproc using airflow operator

I have a service created for dataproc metastore(in same project as composer's) and trying to use it instead of my hive warehouse. I could successfully run this using gcloud commands but when I am trying to use any airflow operators…
0
votes
1 answer

google dataproc jobs submit with local keyTab / ticketCache file

I am trying to submit a dataproc job that will consume data from a Kerberized Kafka cluster. Current working solution is to have the jaas config file and keytab on the machine which is making the dataproc jobs submit command: gcloud dataproc jobs…
0
votes
1 answer

Dataproc; Spark job fails on Dataproc Spark cluster, but runs locally

I have a JAR file generated via a Maven project that works fine when I run it locally via java -jar JARFILENAME.jar. However, when I try to run the same JAR file on Dataproc I get the following error: 22/06/27 13:13:45 INFO…
Kasper
  • 72
  • 5
0
votes
1 answer

GCP Dataproc - adding multiple packages(kafka, mongodb) while submitting jobs not working

I'm trying to add the kafka & mongoDB packages while submitting dataproc pyspark jobs, however that is failing. So far, i've been using only kafka package and that is working fine, however when i try to add mongoDB package in the command below it…
0
votes
1 answer

Trigger spark submit jobs from airflow on Dataproc Cluster without SSH

currently, am executing my spark-submit commands in airflow by SSH using BashOperator & BashCommand but our client is not allowing us to do SSH into the cluster, is that possible to execute the Spark-submit command without SSH into cluster from…
Kriz
  • 57
  • 4
0
votes
1 answer

Where GCP dataproc stores notebook instances?

I created a Spark cluster using Dataproc with Jupyter Notebook attached to it. Then I Deleted the cluster and I assumed the notebooks are gone. However after creating another cluster (connected to the same Bucket) I can see my old notebooks. Does it…
Ala Tarighati
  • 3,507
  • 5
  • 17
  • 34
0
votes
1 answer

Can not open Jupyter notebook on Dataproc?

I have created GCP dataproc cluster but I have enabled gateway and selected Anaconda and Jupyter notebook , but when I try to open jupyter notebook , the follwing message pops-us : What can I do ?
1 2 3
8
9