Highest Voted 'google-cloud-dataproc' Questions

0

votes

1 answer

Failed when followed instruction to setup SSH tunnel for Datalab in Dataproc

I created a Google Dataproc cluster with Datalab installed. Then I followed the instruction to set up the SSH tunneling. But I got an error. I also tried other ports. Got the same error. Not sure why. I was wondering if anything wrong with…

asked Mar 06 '18 at 18:28

Bin Lin

25
6

0

votes

1 answer

Dataproc spark job not able to scan records from bigtable

We are using newAPIHadoopRDD to scan a bigtable and add the records in Rdd. Rdd gets populated using newAPIHadoopRDD for a smaller (say less than 100K records) bigtable. However, it fails to load records into Rdd from larger(say 6M records)…

apache-spark hbase google-cloud-dataproc bigtable

asked Mar 02 '18 at 09:25

Vivek Mahajan

11
1

0

votes

1 answer

How can I set the number of partitions when using the Bigquery Connector in Apache Spark?

I am reading the documentation both for Google Cloud Dataproc and generally for Apache Spark and am unable to figure out how to manually set the number of partitions when using the Bigquery connector. The HDD is created using newAPIHadoopRDD and my…

apache-spark pyspark google-bigquery google-cloud-dataproc

asked Feb 25 '18 at 22:47

Justin

2,322
1
16
22

0

votes

1 answer

Read Large Data Set to Jupyter Notebook and Manipulate

I am trying to load data from BigQuery to Jupyter Notebook, where I will do some manipulation and plotting. The datasets is 25 millions rows with 10 columns, which definitely exceeds my machine's memory capacity(16 GB). I have read this post about …

python-3.x google-bigquery out-of-memory jupyter-notebook google-cloud-dataproc

asked Feb 14 '18 at 19:54

Frank

725
9
6

0

votes

0 answers

gcloud.dataproc.jobs.submit.hive The property [proxy.port] must have an integer value

I have Hive table created on Google Cloud Dataproc, while executing below SQL query I am getting exception like this: gcloud dataproc jobs submit hive --cluster mycluster \ -e "select * from table limit 10;" ERROR:…

hive google-cloud-platform hiveql google-cloud-dataproc

asked Feb 08 '18 at 11:23

Dastagiri Shaik

21
1
5

0

votes

0 answers

Worker not utilized when spark reading lot of Parquet Files

I have a GCS storage which the data is partitioned like this: year/month/day plus Dataproc Cluster that have 89 executor in 30 worker with 24g memory per executor. The question is, when i want to read the parquet files on 2016/5/* Somehow, the…

apache-spark google-cloud-dataproc

asked Feb 02 '18 at 12:42

ByanJati

83
1
11

0

votes

1 answer

How can I change the "Cloud storage Staging Bucket" of an existing Dataproc cluster?

I have one Dataproc cluster, and its Cloud Storage staging bucket is set to a bucket that no longer exists (was made just for testing purposes). There is another bucket that we wish to use instead. How would I connect this cluster to that bucket? I…

google-cloud-platform google-cloud-storage google-cloud-dataproc

asked Jan 25 '18 at 17:07

vasia

1,093
7
18

0

votes

1 answer

Google Dataproc jobs tab not listing the jobs

I have created dataproc cluster and processed dataproc jobs. When I select jobs tab, It didn't list the created jobs even when I select all regions.

google-cloud-dataproc

asked Jan 25 '18 at 09:01

Beu

1,370
10
23

0

votes

0 answers

dataproc hive count mismatch for partition tables

I have tables on dataproc hadoop cluster which already contains data and they are stable. but when i put additional partitions to it and repair, it still give me row count of older state. so new partitions are added to metastore but still new row…

hadoop hive google-cloud-platform hadoop2 google-cloud-dataproc

asked Jan 17 '18 at 17:28

Kaustubh Deshpande

438
6
13

0

votes

1 answer

Sync files on hdfs having same size but varies in contents

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected. if file size is same on source and destination both of them fails to update it even if file content…

hadoop hive hadoop2 hortonworks-data-platform google-cloud-dataproc

asked Jan 16 '18 at 20:58

Kaustubh Deshpande

438
6
13

0

votes

2 answers

Move data from google cloud storage to S3 using dataproc hadoop cluster and airflow

I am trying to transfer a large quantity of data from GCS to S3 bucket. I have spun up a hadoop cluster using Google DataProc. I am able to run the job via the Hadoop CLI using the following: hadoop distcp -update gs://GCS-bucket/folder…

amazon-s3 google-cloud-platform google-cloud-storage airflow google-cloud-dataproc

asked Jan 11 '18 at 16:08

D_usv

433
7
21

0

votes

2 answers

Is it better to create many small Spark clusters or a smaller number of very large clusters

I am currently developing an application to wrangle a huge amount of data using Spark. The data is a mixture of Apache (and other) log files as well as csv and json files. The directory structure of my Google bucket will look something like…

apache-spark pyspark google-cloud-dataproc

asked Jan 10 '18 at 20:56

Mike Malloy

1,520
1
15
19

0

votes

1 answer

GCP Dataproc JDBC driver for pyspark job

I am trying to load postgres db in dataproc by pyspark jobs. My codes work in local spark, but I have trouble things work in dataproc because driver problem. I tried to load them by jarFileUris specifying them in jarFileUris( tried both google…

pyspark google-cloud-dataproc

asked Jan 10 '18 at 06:52

Yong Hyun Kwon

359
1
3
15

0

votes

1 answer

Pyspark join getting failed on Dataproc

I am trying to run some python pyspark script on Dataproc cluster but getting failed with below error: File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 815, in join if isinstance(on[0], basestring): IndexError: list…

python pyspark google-cloud-platform google-cloud-dataproc

asked Jan 04 '18 at 19:29

SB07

76
7

0

votes

1 answer

Google Dataproc Insufficient number of DataNodes reporting

I use the default network configurations and try to run a standard cluster with 1 master and 2 workers but it always fail. Worker nodes fails to do an RPC to master or vice-versa. I also get an info message on the cluster page notifying me that…

google-cloud-dataproc

asked Jan 03 '18 at 16:24

Karim Tarabishy

1,223
1
13
25

Questions tagged [google-cloud-dataproc]