Highest Voted 'google-cloud-dataproc' Questions

0

votes

1 answer

Spark HBase to Google Dataproc and Bigtable migration

I have HBase Spark job running at AWS EMR cluster. Recently we moved to GCP. I transferred all HBase data to BigTable. Now I am running same Spark - Java/Scala job in Dataproc. Spark job failing as it is looking spark.hbase.zookeeper.quorum…

asked May 26 '18 at 23:50

nxverma

3
2

0

votes

1 answer

GCP Dataproc: Poor network bandwidth using storage connector

Update When loading the files using DataFrame I achieved far superior performance. Haven't had a chance to look into why this, but reading like this then converting to RDD is the best solution I've found so…

apache-spark hadoop google-cloud-platform google-cloud-storage google-cloud-dataproc

asked May 18 '18 at 12:09

Daniel Messias

2,623
2
18
21

0

votes

2 answers

Iterate Spark Dataframe running slow

I would like to verify existing column's data and create new column based on the certain conditions. Problem: I have dataset around 500 columns and 9K rows (9000). Per my logic, if one of the column has any null value then create new column with…

scala apache-spark apache-spark-sql apache-spark-mllib google-cloud-dataproc

asked May 07 '18 at 22:26

user1182370

19
5

0

votes

1 answer

How to cache data on Google Dataproc worker nodes

I want to cache some data (ndarrays) locally on worker nodes to do some comparison with ndarray distributed from incoming RDDs from Spark streaming. What is the best way to do it? Since I want to compare ndarrays stored in my files with each single…

apache-spark pyspark google-cloud-platform spark-streaming google-cloud-dataproc

asked May 04 '18 at 16:54

BVBC

375
2
4
11

0

votes

2 answers

Error while using Bigquery connector

I am getting this error when running the Spotify Spark Bigquery connector on Qubole data platform. I do see the BigQueryUtils class in my jar but still it throws this error: Exception in thread "main" …

scala apache-spark google-bigquery google-cloud-dataproc

asked May 04 '18 at 16:06

edocx

87
1
6

0

votes

2 answers

Install pyspark on Google cloud Dataproc cause "could not find valid SPARK_HOME while searching['/tmp', '/usr/local/bin']"

I create a cluster with Google Cloud Dataproc. I can submit job to the cluster just fine until I do pip3 install pyspark on the cluster. After that, each time I try to submit a job, I received an error: Could not find valid SPARK_HOME while…

apache-spark pyspark pip google-cloud-platform google-cloud-dataproc

asked Apr 28 '18 at 00:48

user5574376

371
1
5
12

0

votes

1 answer

Extra Delimiters while writing a spark dataframe to hdfs

One of the columns in my source datafile contains double quotes ("), and when I try to write this data from a dataframe into hdfs using pyspark code, it adds extra delimiters in the file. Not sure what is happening right here. My source data has 51…

apache-spark dataframe pyspark google-cloud-dataproc

asked Apr 27 '18 at 23:37

vp1008

75
2
10

0

votes

1 answer

Unable to create Dataproc cluster using custom image

I am able to create a google dataproc cluster from the command line using a custom image: gcloud beta dataproc clusters create cluster-name --image=custom-image-name as specified in https://cloud.google.com/dataproc/docs/guides/dataproc-images,…

gcloud google-cloud-dataproc google-apis-explorer

asked Apr 19 '18 at 20:31

Georges Kohnen

170
1
10

0

votes

1 answer

Can I display column headings when querying via gcloud dataproc jobs submit spark-sql?

I'm issuing a spark-sql job to dataproc that simply displays some data from a table: gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "select * from mydb.mytable limit 10" When the data is returned and outputted to…

apache-spark-sql google-cloud-dataproc

asked Apr 17 '18 at 07:39

jamiet

10,501
14
80
159

0

votes

1 answer

Run xgboost on Google Cloud Dataproc

I'm a newbie to distributed learning on virtual machines. Now I have a large dataset and want to run xgboost on Google Cloud Dataproc. I checked the tutorial in xgboost git about running on AWS, but I think this is different from Google Cloud. …

hadoop google-cloud-platform distributed-computing xgboost google-cloud-dataproc

asked Apr 06 '18 at 18:35

Demo

291
1
5
16

0

votes

0 answers

Spark job failing on Dataproc (it works on Databricks), error messages not clear to me

Update: I needed to increase the memory on the Dataproc nodes, but I couldn't get to the Spark UI for various reasons to see why the executors were dying. Coming back to this project with a little more Spark and GCP experience allowed me to quickly…

apache-spark pyspark google-cloud-dataproc

asked Mar 28 '18 at 17:30

Adair

1,697
18
22

0

votes

1 answer

Libraries conflict after upgrade of google-cloud library in my java software running on dataproc

I have a problem after upgrade google-cloud library from 0.8.0 to 0.32.0-alpha version on my java software running on google dataproc. Here my maven dependencies: com.google.cloud …

java maven google-cloud-platform google-cloud-dataproc

asked Mar 26 '18 at 14:45

Andrea Zonzin

1,124
2
11
26

0

votes

1 answer

Which java google cloud library for bigquery and dataproc combo?

I'm a little confused about which google cloud java libraries I have to use in my java spark application submitted to google dataproc. In my application I have to use different google cloud services. In the bigquery documentation, for example, I…

java google-cloud-platform google-bigquery google-cloud-dataproc

asked Mar 24 '18 at 01:07

Andrea Zonzin

1,124
2
11
26

0

votes

1 answer

Dataproc conflict in hadoop temporary tables

I have a flow that executes spark jobs on Dataproc clusters in parallel for different zones. For each zone it creates a cluster, execute the spark job and delete the cluster after it finishes. The spark job uses the…

hadoop apache-spark google-cloud-dataproc

asked Mar 13 '18 at 23:23

Bruno

182
2
12

0

votes

1 answer

Dependency Issues for Cloud DataProc + Spark + Cloud BigTable with JAVA

I need to create an application to run on Cloud DataProc and process large BigTable writes, scans, and deletes in massively parallel fashion using Spark. This could be in JAVA (or Python if it's doable). I am trying to write the minimum code using…

apache-spark hbase google-cloud-dataproc google-cloud-bigtable

asked Mar 12 '18 at 21:32

VS_FF

2,353
3
16
34

Questions tagged [google-cloud-dataproc]