Questions tagged [dataproc]
130 questions
0
votes
1 answer
Connect PySpark session to DataProc
I'm trying to connect a PySpark session running locally to a DataProc cluster. I want to be able to work with files on gcs without downloading them. My goal is to perform ad-hoc analyses using local Spark, then switch to a larger cluster when I'm…

oneextrafact
- 159
- 1
- 9
0
votes
1 answer
How do I set up sparkmagic to work with DataProc through Livy?
I have a DataProc cluster running in GCP. I ran the Livy initialization script for it, and I can access the livy/sessions link through the gateway interface. I have the following set up for my sparkmagic config.json:
{
…

oneextrafact
- 159
- 1
- 9
0
votes
1 answer
How to view output files from Dataproc job on Google Cloud Platform
How can I view the contents of the output files from my dataproc job?
Is this something I need to change in the code I've written for the dataproc .jar file?
this is my storage bucket for the output of the job

dvb
- 11
- 2
0
votes
1 answer
How to add bigquery-connector to an existing cluster on dataproc
I've just started to use dataproc for doing machine learning on big data in bigquery.When i try to run this code :
df = spark.read.format('bigquery').load('bigquery-public-data.samples.shakespeare')
I get an error with some part of like this…

Kerem Tatlıcı
- 49
- 5
0
votes
1 answer
Google Dataproc pySpark slow on public BigQuery table
I am trying to work with pySpark on this google public BigQuery table (Table size: 268.42 GB, Number of rows: 611,647,042). I set the region of the cluster to US (the same of the BigQuery table) but the code it's extremely slow even when using…

frebls
- 1
0
votes
0 answers
pyspark - how to run and schedule streaming jobs in dataproc hosted on GCP
I am trying to have a pyspark code to stream the data from delta table and perform the merge against final delta target continuously at an interval of 10 - 15 mins between each cycle.
I have written a simple pyspark code and submitting the job in…

Rak
- 196
- 2
- 9
0
votes
0 answers
Problems running Spark on GCP
We run a number of scripts for every release of our platform and we want to automate the run of these scripts with Snakemake. The plan is to fire up a VM on Google Cloud and run snakemake there, where the location of the input/output files are read…

irenels
- 3
- 2
0
votes
1 answer
gcloud dataproc clusters list filter by !=
How do I filter dataproc clusters using a != (not equal to)? I've tried:
gcloud dataproc clusters list --region=us-east4 --project= --filter="labels.disposition!=permanent"
ERROR: (gcloud.dataproc.clusters.list) INVALID_ARGUMENT:…

schirayu
- 1
- 2
0
votes
1 answer
dataproc create cluster gcloud equivalent command in python
How do I replicate the following gcloud command in python?
gcloud beta dataproc clusters create spark-nlp-cluster \
--region global \
--metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \
--worker-machine-type…

Machine Learning
- 485
- 6
- 15
-1
votes
1 answer
what's difference between dataproc cluster on GKE vs Compute engine?
We can now create dataproc clusters using compute engine or GKE. What are the major advantages of creating a cluster on GKE vs Compute Engine. We have faced problem of insufficient resources in zone error multiple times while creating cluster on…

Nishit patel
- 3
- 3