Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
0
votes
1 answer

YARN cluster mode reduces number of executor instances

I'm provisioning a Google Cloud Dataproc cluster in the following way: gcloud dataproc clusters create spark --async --image-version 1.2 \ --master-machine-type n1-standard-1 --master-boot-disk-size 10 \ --worker-machine-type n1-highmem-8…
Martin Studer
  • 2,213
  • 1
  • 18
  • 23
0
votes
2 answers

passing properties argument for gcloud dataproc jobs submit pyspark

I am trying to submit a pyspark job to google cloud dataproc via the command line these are my arguments; gcloud dataproc jobs submit pyspark --cluster mongo-load --properties org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 mongo_load.py I am…
0
votes
1 answer

Google Cloud Dataproc - Encryption in transit?

Does anyone know what the following from the FAQs(https://cloud.google.com/dataproc/docs/resources/faq) actually means? "Data can be user encrypted in transit to and from a cluster, upon cluster creation or job submission." I can find no…
K2J
  • 2,573
  • 6
  • 27
  • 34
0
votes
1 answer

Calling GCP Translate API within Dataproc pyspark map

I am trying to call the language detection method of the translate client api from pyspark for each row in a file. I created a map method as the following but the job seems to just freeze with no error. If I remove the call to the translate API it…
Adam Taub
  • 69
  • 4
0
votes
1 answer

Connect tableau to Google Dataproc

I am wondering how to connect tableau to Google Dataproc via SPARK SQL? I am trying to connect using External IP address of the master node and port but it does not work.
Grzegorzg
  • 659
  • 1
  • 4
  • 17
0
votes
1 answer

Clou dataproc initialization actions - port assignments

We want to deploy a number of the applications into our cluster (Tez, Hue, Presto, Zeppelin and Oozie. A quick scan of the repo suggests that some of the ports will conflict by default (Zeppellin and Presto). Is this a bug? How do I go about…
K2J
  • 2,573
  • 6
  • 27
  • 34
0
votes
1 answer

Any way to recover the deleted the VM instance created in DataProc cluster

In case any VM instance gets deleted accidentally then is there any way to recover it in Dataproc cluster. In case there is no way to recover a deleted VM instance then can we create a new VM instance and connect to an existing DataProc…
Balajee Venkatesh
  • 1,041
  • 2
  • 18
  • 39
0
votes
1 answer

Pass packages to pyspark running on dataproc from airflow?

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command: gcloud dataproc jobs submit pyspark \ --cluster my-cluster \ --properties…
Nira
  • 469
  • 1
  • 6
  • 16
0
votes
2 answers

Python - ImportError: No module Cloud Dataproc

I am having an issue with Google Cloud Dataproc with the structuring of my python project. I have a number of files which are all in the same folders and which call one another through import. The overall program runs fine locally. However, when I…
Kl1
  • 3
  • 2
0
votes
1 answer

set executable PATH in Jupyter Notebook on google cloud cluster Python3

I opened jupyter notebook on my google cloud cluster with these steps: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook Now I get an error on this piece of code: import selenium from contextlib import closing from…
0
votes
1 answer

Connection timeout between dataproc and sql (CPB100 - Lab3b)

I try to run a script from Google CPB100 - Lab3b (train_and_apply.py) with dataproc against SLQ (mysql ddbb) but I get a timeout. Caused by: java.net.ConnectException: Connection timed out (Connection timed out) From the dataproc master I can…
Seguy
  • 56
  • 5
0
votes
1 answer

Google cloud storage connector within sparkR on dataproc

I see that the gs:// interface is available within spark and pyspark on the dataproc cluster but doesn’t work in the SparkR shell. Is there a way to make it work? The path is simply not found if you run it. I am aware of the cloudyR project.
Alex
  • 19,533
  • 37
  • 126
  • 195
0
votes
1 answer

How to set region and zone (Google Cloud Platform) in DataProcPigOperator with Airflow

i have get problem. I will create code DataProcPigOperator before running well because default zone is running global, but i change definition region in my cluster in asia-easth1 code is not running, because DataProcPigOperator is default job…
RJK
  • 239
  • 2
  • 3
  • 14
0
votes
2 answers

How to run cluster initialization script on GCP after creation of cluster

I have created a Google Dataproc cluster, but need to install presto as I now have a requirement. Presto is provided as an initialization action on Dataproc here, how can I run this initialization action after creation of the cluster.
0
votes
1 answer

Read partitionned table or a view in bigQuery with apache spark

I'm using dataproc-bigQuery connector to read a partitioned table ,it contains over 300GB of data and its partitioned by date ,but all I need is the data from today to read with the spark connector, I tried reading it with a view from bigquery…