Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
0
votes
1 answer

Spark HBase to Google Dataproc and Bigtable migration

I have HBase Spark job running at AWS EMR cluster. Recently we moved to GCP. I transferred all HBase data to BigTable. Now I am running same Spark - Java/Scala job in Dataproc. Spark job failing as it is looking spark.hbase.zookeeper.quorum…
0
votes
1 answer

GCP Dataproc: Poor network bandwidth using storage connector

Update When loading the files using DataFrame I achieved far superior performance. Haven't had a chance to look into why this, but reading like this then converting to RDD is the best solution I've found so…
0
votes
2 answers

Iterate Spark Dataframe running slow

I would like to verify existing column's data and create new column based on the certain conditions. Problem: I have dataset around 500 columns and 9K rows (9000). Per my logic, if one of the column has any null value then create new column with…
0
votes
1 answer

How to cache data on Google Dataproc worker nodes

I want to cache some data (ndarrays) locally on worker nodes to do some comparison with ndarray distributed from incoming RDDs from Spark streaming. What is the best way to do it? Since I want to compare ndarrays stored in my files with each single…
0
votes
2 answers

Error while using Bigquery connector

I am getting this error when running the Spotify Spark Bigquery connector on Qubole data platform. I do see the BigQueryUtils class in my jar but still it throws this error: Exception in thread "main" …
0
votes
2 answers

Install pyspark on Google cloud Dataproc cause "could not find valid SPARK_HOME while searching['/tmp', '/usr/local/bin']"

I create a cluster with Google Cloud Dataproc. I can submit job to the cluster just fine until I do pip3 install pyspark on the cluster. After that, each time I try to submit a job, I received an error: Could not find valid SPARK_HOME while…
0
votes
1 answer

Extra Delimiters while writing a spark dataframe to hdfs

One of the columns in my source datafile contains double quotes ("), and when I try to write this data from a dataframe into hdfs using pyspark code, it adds extra delimiters in the file. Not sure what is happening right here. My source data has 51…
vp1008
  • 75
  • 2
  • 10
0
votes
1 answer

Unable to create Dataproc cluster using custom image

I am able to create a google dataproc cluster from the command line using a custom image: gcloud beta dataproc clusters create cluster-name --image=custom-image-name as specified in https://cloud.google.com/dataproc/docs/guides/dataproc-images,…
0
votes
1 answer

Can I display column headings when querying via gcloud dataproc jobs submit spark-sql?

I'm issuing a spark-sql job to dataproc that simply displays some data from a table: gcloud dataproc jobs submit spark-sql --cluster mycluster --region europe-west1 -e "select * from mydb.mytable limit 10" When the data is returned and outputted to…
jamiet
  • 10,501
  • 14
  • 80
  • 159
0
votes
1 answer

Run xgboost on Google Cloud Dataproc

I'm a newbie to distributed learning on virtual machines. Now I have a large dataset and want to run xgboost on Google Cloud Dataproc. I checked the tutorial in xgboost git about running on AWS, but I think this is different from Google Cloud. …
0
votes
0 answers

Spark job failing on Dataproc (it works on Databricks), error messages not clear to me

Update: I needed to increase the memory on the Dataproc nodes, but I couldn't get to the Spark UI for various reasons to see why the executors were dying. Coming back to this project with a little more Spark and GCP experience allowed me to quickly…
Adair
  • 1,697
  • 18
  • 22
0
votes
1 answer

Libraries conflict after upgrade of google-cloud library in my java software running on dataproc

I have a problem after upgrade google-cloud library from 0.8.0 to 0.32.0-alpha version on my java software running on google dataproc. Here my maven dependencies: com.google.cloud
Andrea Zonzin
  • 1,124
  • 2
  • 11
  • 26
0
votes
1 answer

Which java google cloud library for bigquery and dataproc combo?

I'm a little confused about which google cloud java libraries I have to use in my java spark application submitted to google dataproc. In my application I have to use different google cloud services. In the bigquery documentation, for example, I…
0
votes
1 answer

Dataproc conflict in hadoop temporary tables

I have a flow that executes spark jobs on Dataproc clusters in parallel for different zones. For each zone it creates a cluster, execute the spark job and delete the cluster after it finishes. The spark job uses the…
Bruno
  • 182
  • 2
  • 12
0
votes
1 answer

Dependency Issues for Cloud DataProc + Spark + Cloud BigTable with JAVA

I need to create an application to run on Cloud DataProc and process large BigTable writes, scans, and deletes in massively parallel fashion using Spark. This could be in JAVA (or Python if it's doable). I am trying to write the minimum code using…
VS_FF
  • 2,353
  • 3
  • 16
  • 34