Highest Voted 'google-cloud-dataproc' Questions

6

votes

3 answers

How to use Google Cloud Storage for checkpoint location in streaming query?

Im trying to run Spark Structured Streaming job and save checkpoint to Google Storage, I have a couple of jobs, one w/o aggregation works perfectly, but second with aggregations throw exception. I found that someone have similar issues with…

asked May 15 '19 at 15:06

Oleksandr Marchenko

161
2
7

6

votes

2 answers

How can I perform data lineage in GCP?

When we realize the data lake with GCP Cloud storage, and data processing with Cloud services such as Dataproc, Dataflow, how can we generated data lineage report in GCP?

google-cloud-platform google-cloud-dataproc data-lineage

asked Mar 05 '19 at 10:40

Raghavendra Prakash

155
2
9

6

votes

1 answer

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?

hadoop google-cloud-dataproc

asked Jan 29 '19 at 21:11

DivyaMishra

73
1
4

6

votes

1 answer

Container killed by YARN for exceeding memory limits

I am creating a cluster in google dataproc with the following characteristics: Master Standard (1 master, N workers) Machine n1-highmem-2 (2 vCPU, 13.0 GB memory) Primary disk 250 GB Worker nodes 2 Machine type n1-highmem-2 (2…

apache-spark apache-zeppelin google-cloud-dataproc

asked May 29 '18 at 14:55

Mpizos Dimitris

4,819
12
58
100

6

votes

0 answers

How can I make Spark Thrift Server clean up its cache?

We're using Spark Thrift Server as a long-running service for ad-hoc SQL queries, instead of Hive/Tez. This is working out fairly well, except that every few days it starts filling up the disk on worker nodes. The files are all in…

apache-spark hadoop-yarn google-cloud-dataproc

asked Oct 23 '17 at 15:31

user271667

181
4

6

votes

2 answers

How do I restart hadoop services on dataproc cluster

I may be searching with the wrong terms, but google is not telling me how to do this. The question is how can I restart hadoop services on Dataproc after changing some configuration files (yarn properties, etc)? Services have to be restarted on a…

hadoop hadoop-yarn google-cloud-dataproc

asked Apr 03 '17 at 20:25

EduBoom

147
1
8

6

votes

1 answer

How can I run two parallel jobs on Google Dataproc

I have one job that will take a long time to run on DataProc. In the meanwhile I need to be able to run other smaller jobs. From what I could gather from the Google Dataproc documentation, the platform is supposed to support multiple jobs, since it…

google-cloud-platform google-cloud-dataproc

asked Feb 13 '17 at 14:26

fbexiga

75
1
5

6

votes

1 answer

How to import csv files with massive column count into Apache Spark 2.0

I'm running into a problem importing multiple small csv files with over 250000 columns of float64 into Apache Spark 2.0 running as a Google Dataproc cluster. There are a handful of string columns but only really interested in 1 as the class…

csv apache-spark pyspark apache-spark-mllib google-cloud-dataproc

asked Aug 27 '16 at 19:27

mobcdi

1,532
2
28
49

6

votes

1 answer

How to update spark configuration after resizing worker nodes in Cloud Dataproc

I have a DataProc Spark cluster. Initially, the master and 2 worker nodes are of type n1-standard-4 (4 vCPU, 15.0 GB memory), then I resized all of them to n1-highmem-8 (8 vCPUs, 52 GB memory) via the web console. I noticed that the two workers…

apache-spark google-compute-engine google-cloud-platform google-cloud-dataproc

asked Aug 03 '16 at 17:00

zyxue

7,904
5
48
74

6

votes

1 answer

Spark looses all executors one minute after starting

I run pyspark on 8 node Google dataproc cluster with default settings. Few seconds after starting I see 30 executor cores running (as expected): >>> sc.defaultParallelism 30 One minute later: >>> sc.defaultParallelism 2 From that…

apache-spark pyspark google-cloud-dataproc

asked Feb 26 '16 at 10:23

Tomas Vitulskis

75
1
5

6

votes

2 answers

use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this: I started a ssh session with the master node of my cluster,…

import apache-spark pyspark google-cloud-dataproc

asked Oct 27 '15 at 08:38

sweeeeeet

1,769
4
26
50

6

votes

3 answers

Spark - Adding JDBC Driver JAR to Google Dataproc

I am trying to write via JDBC: df.write.jdbc("jdbc:postgresql://123.123.123.123:5432/myDatabase", "myTable", props) The Spark docs explain that the configuration option spark.driver.extraClassPath cannot be used to add JDBC Driver JARs if running…

apache-spark jdbc google-cloud-platform apache-spark-sql google-cloud-dataproc

asked Oct 05 '15 at 21:37

BAR

15,909
27
97
185

6

votes

1 answer

How do I install Python libraries automatically on Dataproc cluster startup?

How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need. It would be great to…

hadoop apache-spark google-cloud-platform google-cloud-dataproc

asked Sep 23 '15 at 17:29

James

2,321
14
30

5

votes

0 answers

Why is UDF slower than pandas UDF on PySpark?

I am taking my first steps in PySpark, and currently, I am studying UDFs and pandas UDFs. I have read several forums, and they more or less agree that "pandas UDFs allow vectorized operations that can increase performance up to 100x compared to…

pandas pyspark user-defined-functions google-cloud-dataproc

asked Aug 11 '22 at 00:35

David Espinosa

760
7
21

5

votes

1 answer

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html I am able to read the data from Hudi…

apache-spark google-cloud-dataproc apache-hudi

asked Jun 13 '22 at 04:01

radhika sharma

499
1
9
28

Questions tagged [google-cloud-dataproc]