Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
6
votes
3 answers

How to use Google Cloud Storage for checkpoint location in streaming query?

Im trying to run Spark Structured Streaming job and save checkpoint to Google Storage, I have a couple of jobs, one w/o aggregation works perfectly, but second with aggregations throw exception. I found that someone have similar issues with…
6
votes
2 answers

How can I perform data lineage in GCP?

When we realize the data lake with GCP Cloud storage, and data processing with Cloud services such as Dataproc, Dataflow, how can we generated data lineage report in GCP?
6
votes
1 answer

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?
DivyaMishra
  • 73
  • 1
  • 4
6
votes
1 answer

Container killed by YARN for exceeding memory limits

I am creating a cluster in google dataproc with the following characteristics: Master Standard (1 master, N workers) Machine n1-highmem-2 (2 vCPU, 13.0 GB memory) Primary disk 250 GB Worker nodes 2 Machine type n1-highmem-2 (2…
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
6
votes
0 answers

How can I make Spark Thrift Server clean up its cache?

We're using Spark Thrift Server as a long-running service for ad-hoc SQL queries, instead of Hive/Tez. This is working out fairly well, except that every few days it starts filling up the disk on worker nodes. The files are all in…
6
votes
2 answers

How do I restart hadoop services on dataproc cluster

I may be searching with the wrong terms, but google is not telling me how to do this. The question is how can I restart hadoop services on Dataproc after changing some configuration files (yarn properties, etc)? Services have to be restarted on a…
EduBoom
  • 147
  • 1
  • 8
6
votes
1 answer

How can I run two parallel jobs on Google Dataproc

I have one job that will take a long time to run on DataProc. In the meanwhile I need to be able to run other smaller jobs. From what I could gather from the Google Dataproc documentation, the platform is supposed to support multiple jobs, since it…
fbexiga
  • 75
  • 1
  • 5
6
votes
1 answer

How to import csv files with massive column count into Apache Spark 2.0

I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class…
mobcdi
  • 1,532
  • 2
  • 28
  • 49
6
votes
1 answer

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console. I noticed that the two workers…
6
votes
1 answer

Spark looses all executors one minute after starting

I run pyspark on 8 node Google dataproc cluster with default settings. Few seconds after starting I see 30 executor cores running (as expected): >>> sc.defaultParallelism 30 One minute later: >>> sc.defaultParallelism 2 From that…
6
votes
2 answers

use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this: I started a ssh session with the master node of my cluster,…
sweeeeeet
  • 1,769
  • 4
  • 26
  • 50
6
votes
3 answers

Spark - Adding JDBC Driver JAR to Google Dataproc

I am trying to write via JDBC: df.write.jdbc("jdbc:postgresql://123.123.123.123:5432/myDatabase", "myTable", props) The Spark docs explain that the configuration option spark.driver.extraClassPath cannot be used to add JDBC Driver JARs if running…
6
votes
1 answer

How do I install Python libraries automatically on Dataproc cluster startup?

How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need. It would be great to…
5
votes
0 answers

Why is UDF slower than pandas UDF on PySpark?

I am taking my first steps in PySpark, and currently, I am studying UDFs and pandas UDFs. I have read several forums, and they more or less agree that "pandas UDFs allow vectorized operations that can increase performance up to 100x compared to…
5
votes
1 answer

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html I am able to read the data from Hudi…
radhika sharma
  • 499
  • 1
  • 9
  • 28