Highest Voted 'google-cloud-dataproc' Questions

0

votes

1 answer

read data from BigQuery and/or Cloud Storage GCS into Dataproc

I am reading data from BigQuery into dataproc spark cluster. If the data in BigQuery table in my case is originally loaded from GCS, then is it better to read data from GCS directly into spark cluster, since BigQuery connector for dataproc…

google-bigquery google-cloud-storage google-cloud-dataproc

asked Sep 29 '17 at 16:55

bignano

573
5
21

0

votes

1 answer

"ImportError: no module named pandas" when trying to submit a job on Dataproc

I'm running a script using the Python Client Library for Google Cloud Dataproc that automatically provisions clusters, submits jobs, etc. But while trying to submit a job, it returns with ImportError: no module named pandas. I import pandas, as well…

google-cloud-platform google-cloud-dataproc google-cloud-python

asked Sep 22 '17 at 17:10

claudiadast

419
1
9
18

0

votes

1 answer

PySpark via Dataproc + SSL Connection to Cloud SQL

I have a Cloud SQL instance storing data in a database, and I have checked the option for this Cloud SQL instance to block all unencrypted connections. When I select this option, I am given three SSL certificates - a server certificate, a client…

python ssl jdbc google-cloud-sql google-cloud-dataproc

asked Sep 21 '17 at 07:27

charlesreid1

4,360
4
30
52

0

votes

1 answer

Error when submitting a pyspark job to dataproc cluster (job not found)

I have a script based on a python client library from GCP that is meant to provision clusters and submit jobs to them. When I run the script, it successfully uploads files to google storage, creates a cluster, and submits a job. The error comes in…

google-cloud-platform google-cloud-dataproc google-cloud-python

asked Sep 20 '17 at 17:21

claudiadast

419
1
9
18

0

votes

1 answer

Adding machine-type parameters in Google Cloud Python SDK create_cluster() function

Google cloud's python docs have a script (python-docs-samples/dataproc/submit_job_to_cluster.py) that has the following function: def create_cluster(dataproc, project, zone, region, cluster_name): print('Creating cluster...') zone_uri =…

google-cloud-dataproc google-cloud-python

asked Sep 19 '17 at 18:17

claudiadast

419
1
9
18

0

votes

1 answer

How can I securely transfer my data from on-prem HDFS to Google Cloud Storage?

I have a bunch of data in an on-prem HDFS installation. I want to move some of it to Google Cloud (Cloud Storage) but I have a few concerns: How do I actually move the data? I am worried about moving it over the public internet What is the best…

hadoop hdfs cloud google-cloud-dataproc

asked Sep 19 '17 at 00:10

James

2,321
14
30

0

votes

1 answer

Running more than spark streaming job in Google dataproc

How do I run more than one spark streaming job in dataproc cluster? I created multiple queues using capacity-scheduler.xml but now I will need 12 queues if I want to run 12 different streaming - aggregate applications. Any idea?

apache-spark apache-spark-sql google-cloud-dataproc spark-structured-streaming

asked Sep 18 '17 at 10:58

passionate

503
2
7
25

0

votes

2 answers

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node…

hadoop apache-spark cloud virtual-machine google-cloud-dataproc

asked Sep 06 '17 at 12:37

santobedi

866
3
17
39

0

votes

1 answer

What is the solution for the error, “JBlas is not a member of package or apache”?

I tried to solve it from both of these ( this and this) threads, and it worked for me on my own virtual machine but didn’t work in cloud dataproc. I did the same process for both of them. But there is still error in the cloud which is same as the…

apache-spark virtual-machine google-cloud-dataproc spark-cloudant

asked Sep 01 '17 at 01:58

santobedi

866
3
17
39

0

votes

1 answer

Dataproc PySpark Workers Have no Permission to Use gsutil

Under Dataproc I setup a PySpark cluster with 1 Master Node and 2 Workers. In bucket I have directories of sub-directories of files. In the Datalab notebook I run import subprocess all_parent_direcotry = subprocess.Popen("gsutil ls…

pyspark google-cloud-platform google-cloud-dataproc gsutil google-cloud-datalab

asked Aug 31 '17 at 04:45

B. Sun

143
3
11

0

votes

1 answer

YARN Reserved Memory Issue

When using FIFO scheduler with YARN(FIFO is default right?), I found out YARN reserve some memory/CPU to run the application. Our application doesn't need to reserve any of these, since we want fixed number of cores to do the tasks depending on…

hadoop-yarn google-cloud-dataproc

asked Aug 30 '17 at 08:49

Yong Hyun Kwon

359
1
3
15

0

votes

2 answers

PySpark RDD Sparse Matrix multiplication from scala to python

I previously posted a question on coordinate matrix multiplication with 9 million rows and 85K columns. Errors for block matrix multiplification in Spark However, I ran into Out of Memory issue on DataProc. I have tried to configure the cluster…

python scala pyspark matrix-multiplication google-cloud-dataproc

asked Aug 25 '17 at 12:35

vortex

79
2
8

0

votes

1 answer

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems. Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the…

apache-spark hadoop-yarn google-cloud-dataproc

asked Aug 25 '17 at 11:51

Daniel Martínez

135
2
9

0

votes

1 answer

Errors for block matrix multiplification in Spark

I have created a coordinate matrix cmat with 9 million rows and 85K columns. I would like to perform cmat.T * cmat operations. I first converted cmat to block matrix bmat: bmat = cmat.toBlockMatrix(1000, 1000) However, I got errors when performing…

python pyspark google-cloud-platform google-cloud-dataproc

asked Aug 22 '17 at 15:00

vortex

79
2
8

0

votes

1 answer

Cannot create cluster with properties using the dataproc API

I'm trying to create a cluster programmatically in python: import googleapiclient.discovery dataproc = googleapiclient.discovery.build('dataproc', 'v1') zone_uri ='https://www.googleapis.com/compute/v1/projects/{project_id}/zone/{zone}'.format( …

python google-cloud-dataproc

asked Aug 19 '17 at 01:40

Ajr

186
5

Questions tagged [google-cloud-dataproc]