Questions tagged [dataproc]

130 questions
0
votes
1 answer

Connect PySpark session to DataProc

I'm trying to connect a PySpark session running locally to a DataProc cluster. I want to be able to work with files on gcs without downloading them. My goal is to perform ad-hoc analyses using local Spark, then switch to a larger cluster when I'm…
oneextrafact
  • 159
  • 1
  • 9
0
votes
1 answer

How do I set up sparkmagic to work with DataProc through Livy?

I have a DataProc cluster running in GCP. I ran the Livy initialization script for it, and I can access the livy/sessions link through the gateway interface. I have the following set up for my sparkmagic config.json: { …
oneextrafact
  • 159
  • 1
  • 9
0
votes
1 answer

How to view output files from Dataproc job on Google Cloud Platform

How can I view the contents of the output files from my dataproc job? Is this something I need to change in the code I've written for the dataproc .jar file? this is my storage bucket for the output of the job
0
votes
1 answer

How to add bigquery-connector to an existing cluster on dataproc

I've just started to use dataproc for doing machine learning on big data in bigquery.When i try to run this code : df = spark.read.format('bigquery').load('bigquery-public-data.samples.shakespeare') I get an error with some part of like this…
0
votes
1 answer

Google Dataproc pySpark slow on public BigQuery table

I am trying to work with pySpark on this google public BigQuery table (Table size: 268.42 GB, Number of rows: 611,647,042). I set the region of the cluster to US (the same of the BigQuery table) but the code it's extremely slow even when using…
0
votes
0 answers

pyspark - how to run and schedule streaming jobs in dataproc hosted on GCP

I am trying to have a pyspark code to stream the data from delta table and perform the merge against final delta target continuously at an interval of 10 - 15 mins between each cycle. I have written a simple pyspark code and submitting the job in…
0
votes
0 answers

Problems running Spark on GCP

We run a number of scripts for every release of our platform and we want to automate the run of these scripts with Snakemake. The plan is to fire up a VM on Google Cloud and run snakemake there, where the location of the input/output files are read…
0
votes
1 answer

gcloud dataproc clusters list filter by !=

How do I filter dataproc clusters using a != (not equal to)? I've tried: gcloud dataproc clusters list --region=us-east4 --project= --filter="labels.disposition!=permanent" ERROR: (gcloud.dataproc.clusters.list) INVALID_ARGUMENT:…
schirayu
  • 1
  • 2
0
votes
1 answer

dataproc create cluster gcloud equivalent command in python

How do I replicate the following gcloud command in python? gcloud beta dataproc clusters create spark-nlp-cluster \ --region global \ --metadata 'PIP_PACKAGES=google-cloud-storage spark-nlp==2.5.3' \ --worker-machine-type…
Machine Learning
  • 485
  • 6
  • 15
-1
votes
1 answer

what's difference between dataproc cluster on GKE vs Compute engine?

We can now create dataproc clusters using compute engine or GKE. What are the major advantages of creating a cluster on GKE vs Compute Engine. We have faced problem of insufficient resources in zone error multiple times while creating cluster on…
1 2 3
8
9