Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
0
votes
1 answer

read data from BigQuery and/or Cloud Storage GCS into Dataproc

I am reading data from BigQuery into dataproc spark cluster. If the data in BigQuery table in my case is originally loaded from GCS, then is it better to read data from GCS directly into spark cluster, since BigQuery connector for dataproc…
0
votes
1 answer

"ImportError: no module named pandas" when trying to submit a job on Dataproc

I'm running a script using the Python Client Library for Google Cloud Dataproc that automatically provisions clusters, submits jobs, etc. But while trying to submit a job, it returns with ImportError: no module named pandas. I import pandas, as well…
0
votes
1 answer

PySpark via Dataproc + SSL Connection to Cloud SQL

I have a Cloud SQL instance storing data in a database, and I have checked the option for this Cloud SQL instance to block all unencrypted connections. When I select this option, I am given three SSL certificates - a server certificate, a client…
charlesreid1
  • 4,360
  • 4
  • 30
  • 52
0
votes
1 answer

Error when submitting a pyspark job to dataproc cluster (job not found)

I have a script based on a python client library from GCP that is meant to provision clusters and submit jobs to them. When I run the script, it successfully uploads files to google storage, creates a cluster, and submits a job. The error comes in…
0
votes
1 answer

Adding machine-type parameters in Google Cloud Python SDK create_cluster() function

Google cloud's python docs have a script (python-docs-samples/dataproc/submit_job_to_cluster.py) that has the following function: def create_cluster(dataproc, project, zone, region, cluster_name): print('Creating cluster...') zone_uri =…
claudiadast
  • 419
  • 1
  • 9
  • 18
0
votes
1 answer

How can I securely transfer my data from on-prem HDFS to Google Cloud Storage?

I have a bunch of data in an on-prem HDFS installation. I want to move some of it to Google Cloud (Cloud Storage) but I have a few concerns: How do I actually move the data? I am worried about moving it over the public internet What is the best…
James
  • 2,321
  • 14
  • 30
0
votes
1 answer

Running more than spark streaming job in Google dataproc

How do I run more than one spark streaming job in dataproc cluster? I created multiple queues using capacity-scheduler.xml but now I will need 12 queues if I want to run 12 different streaming - aggregate applications. Any idea?
0
votes
2 answers

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node…
0
votes
1 answer

What is the solution for the error, “JBlas is not a member of package or apache”?

I tried to solve it from both of these ( this and this) threads, and it worked for me on my own virtual machine but didn’t work in cloud dataproc. I did the same process for both of them. But there is still error in the cloud which is same as the…
0
votes
1 answer

Dataproc PySpark Workers Have no Permission to Use gsutil

Under Dataproc I setup a PySpark cluster with 1 Master Node and 2 Workers. In bucket I have directories of sub-directories of files. In the Datalab notebook I run import subprocess all_parent_direcotry = subprocess.Popen("gsutil ls…
0
votes
1 answer

YARN Reserved Memory Issue

When using FIFO scheduler with YARN(FIFO is default right?), I found out YARN reserve some memory/CPU to run the application. Our application doesn't need to reserve any of these, since we want fixed number of cores to do the tasks depending on…
Yong Hyun Kwon
  • 359
  • 1
  • 3
  • 15
0
votes
2 answers

PySpark RDD Sparse Matrix multiplication from scala to python

I previously posted a question on coordinate matrix multiplication with 9 million rows and 85K columns. Errors for block matrix multiplification in Spark However, I ran into Out of Memory issue on DataProc. I have tried to configure the cluster…
0
votes
1 answer

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems. Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the…
0
votes
1 answer

Errors for block matrix multiplification in Spark

I have created a coordinate matrix cmat with 9 million rows and 85K columns. I would like to perform cmat.T * cmat operations. I first converted cmat to block matrix bmat: bmat = cmat.toBlockMatrix(1000, 1000) However, I got errors when performing…
0
votes
1 answer

Cannot create cluster with properties using the dataproc API

I'm trying to create a cluster programmatically in python: import googleapiclient.discovery dataproc = googleapiclient.discovery.build('dataproc', 'v1') zone_uri ='https://www.googleapis.com/compute/v1/projects/{project_id}/zone/{zone}'.format( …
Ajr
  • 186
  • 5