Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
0
votes
1 answer

Failed when followed instruction to setup SSH tunnel for Datalab in Dataproc

I created a Google Dataproc cluster with Datalab installed. Then I followed the instruction to set up the SSH tunneling. But I got an error. I also tried other ports. Got the same error. Not sure why. I was wondering if anything wrong with…
0
votes
1 answer

Dataproc spark job not able to scan records from bigtable

We are using newAPIHadoopRDD to scan a bigtable and add the records in Rdd. Rdd gets populated using newAPIHadoopRDD for a smaller (say less than 100K records) bigtable. However, it fails to load records into Rdd from larger(say 6M records)…
0
votes
1 answer

How can I set the number of partitions when using the Bigquery Connector in Apache Spark?

I am reading the documentation both for Google Cloud Dataproc and generally for Apache Spark and am unable to figure out how to manually set the number of partitions when using the Bigquery connector. The HDD is created using newAPIHadoopRDD and my…
Justin
  • 2,322
  • 1
  • 16
  • 22
0
votes
1 answer

Read Large Data Set to Jupyter Notebook and Manipulate

I am trying to load data from BigQuery to Jupyter Notebook, where I will do some manipulation and plotting. The datasets is 25 millions rows with 10 columns, which definitely exceeds my machine's memory capacity(16 GB). I have read this post about …
0
votes
0 answers

gcloud.dataproc.jobs.submit.hive The property [proxy.port] must have an integer value

I have Hive table created on Google Cloud Dataproc, while executing below SQL query I am getting exception like this: gcloud dataproc jobs submit hive --cluster mycluster \ -e "select * from table limit 10;" ERROR:…
0
votes
0 answers

Worker not utilized when spark reading lot of Parquet Files

I have a GCS storage which the data is partitioned like this: year/month/day plus Dataproc Cluster that have 89 executor in 30 worker with 24g memory per executor. The question is, when i want to read the parquet files on 2016/5/* Somehow, the…
ByanJati
  • 83
  • 1
  • 11
0
votes
1 answer

How can I change the "Cloud storage Staging Bucket" of an existing Dataproc cluster?

I have one Dataproc cluster, and its Cloud Storage staging bucket is set to a bucket that no longer exists (was made just for testing purposes). There is another bucket that we wish to use instead. How would I connect this cluster to that bucket? I…
0
votes
1 answer

Google Dataproc jobs tab not listing the jobs

I have created dataproc cluster and processed dataproc jobs. When I select jobs tab, It didn't list the created jobs even when I select all regions.
Beu
  • 1,370
  • 10
  • 23
0
votes
0 answers

dataproc hive count mismatch for partition tables

I have tables on dataproc hadoop cluster which already contains data and they are stable. but when i put additional partitions to it and repair, it still give me row count of older state. so new partitions are added to metastore but still new row…
0
votes
1 answer

Sync files on hdfs having same size but varies in contents

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected. if file size is same on source and destination both of them fails to update it even if file content…
0
votes
2 answers

Move data from google cloud storage to S3 using dataproc hadoop cluster and airflow

I am trying to transfer a large quantity of data from GCS to S3 bucket. I have spun up a hadoop cluster using Google DataProc. I am able to run the job via the Hadoop CLI using the following: hadoop distcp -update gs://GCS-bucket/folder…
0
votes
2 answers

Is it better to create many small Spark clusters or a smaller number of very large clusters

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like…
Mike Malloy
  • 1,520
  • 1
  • 15
  • 19
0
votes
1 answer

GCP Dataproc JDBC driver for pyspark job

I am trying to load postgres db in dataproc by pyspark jobs. My codes work in local spark, but I have trouble things work in dataproc because driver problem. I tried to load them by jarFileUris specifying them in jarFileUris( tried both google…
Yong Hyun Kwon
  • 359
  • 1
  • 3
  • 15
0
votes
1 answer

Pyspark join getting failed on Dataproc

I am trying to run some python pyspark script on Dataproc cluster but getting failed with below error: File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 815, in join if isinstance(on[0], basestring): IndexError: list…
0
votes
1 answer

Google Dataproc Insufficient number of DataNodes reporting

I use the default network configurations and try to run a standard cluster with 1 master and 2 workers but it always fail. Worker nodes fails to do an RPC to master or vice-versa. I also get an info message on the cluster page notifying me that…
Karim Tarabishy
  • 1,223
  • 1
  • 13
  • 25