Highest Voted 'google-cloud-dataproc' Questions

11

votes

1 answer

BigQuery connector for pyspark via Hadoop Input Format example

I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output…

asked Jul 14 '15 at 08:11

Luca Fiaschi

3,145
7
31
44

10

votes

1 answer

PySpark print to console

When running a PySpark job on the dataproc server like this gcloud --project dataproc jobs submit pyspark --cluster my print statements don't show up in my terminal. Is there any way to output data…

python-2.7 pyspark google-cloud-dataproc

asked May 24 '16 at 07:40

Roman

8,826
10
63
103

10

votes

3 answers

Guava version while using spark-shell

I'm trying to use the spark-cassandra-connector via spark-shell on dataproc, however I am unable to connect to my cluster. It appears that there is a version mismatch since the classpath is including a much older guava version from somewhere else,…

apache-spark cassandra google-cloud-dataproc spark-cassandra-connector

asked Dec 10 '15 at 18:40

SikhNerd

138
1
7

10

votes

1 answer

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value. This did not seem to work. Do I need to provide some relative path for the…

google-cloud-dataproc

asked Sep 25 '15 at 15:43

bjorndv

523
1
5
16

9

votes

1 answer

How does a Dataproc Spark operator return a value and how to capture and return it

How does a Dataproc Spark operator in Airflow return a value and how to capture it. I have a downstream job which capture this result and based on returned value, I've to trigger another job by branch operator.

apache-spark airflow google-cloud-dataproc google-cloud-composer

asked May 04 '20 at 13:21

Chaitanya Chakkirala

121
1
2

9

votes

3 answers

ModuleNotFoundError because PySpark serializer is not able to locate library folder

I have the following folder structure - libfolder - lib1.py - lib2.py - main.py main.py calls libfolder.lib1.py which then calls libfolder.lib2.py and others. It all works perfectly fine in local machine but after I deploy it to Dataproc…

python apache-spark pyspark google-cloud-dataproc

asked Dec 20 '18 at 06:38

Golak Sarangi

809
7
22

9

votes

4 answers

How to run python3 on google's dataproc pyspark

I want to run a pyspark job through Google Cloud Platform dataproc, but I can't figure out how to setup pyspark to run python3 instead of 2.7 by default. The best I've been able to find is adding these initialization commands However, when I ssh…

python-3.x configuration pyspark google-cloud-platform google-cloud-dataproc

asked Aug 23 '17 at 15:33

Roman

8,826
10
63
103

9

votes

2 answers

Request insufficient authentication scopes when running Spark-Job on dataproc

I am trying to run the spark job on the google dataproc cluster as gcloud dataproc jobs submit hadoop --cluster \ --jar file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ --class org.apache.hadoop.examples.WordCount…

apache-spark google-cloud-platform google-cloud-dataproc

asked Apr 12 '17 at 13:43

Vishal

1,442
3
29
48

9

votes

1 answer

spark "basePath" option setting

When I do: allf = spark.read.parquet("gs://bucket/folder/*") I get: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths: ... And the following message after the list of paths: If provided…

apache-spark pyspark google-cloud-dataproc

asked Nov 15 '16 at 11:19

jldupont

93,734
56
203
318

8

votes

4 answers

How to run spark 3.2.0 on google dataproc?

Currently, google dataproc does not have spark 3.2.0 as an image. The latest available is 3.1.2. I want to use the pandas on pyspark functionality that spark has released with 3.2.0. I am doing the following steps to use spark 3.2.0 Created an…

apache-spark pyspark google-cloud-dataproc

asked Dec 07 '21 at 02:51

figs_and_nuts

4,870
2
31
56

8

votes

3 answers

Feature Selection in PySpark

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code. from sklearn.feature_selection import RFECV,RFE logreg =…

python machine-learning pyspark feature-selection google-cloud-dataproc

asked Nov 28 '18 at 21:36

Tushar Mehta

550
1
9
21

8

votes

3 answers

Dataprep vs Dataflow vs Dataproc

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?

google-cloud-platform google-cloud-dataflow google-cloud-dataproc google-cloud-dataprep

asked Jun 20 '18 at 02:19

Ryan Yuan

2,396
2
13
23

8

votes

2 answers

Why does Spark (on Google Dataproc) not use all vcores?

I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use…

apache-spark pyspark hadoop-yarn google-cloud-dataproc

asked Jun 13 '17 at 18:48

borarak

1,130
1
13
24

8

votes

3 answers

How to read simple text file from Google Cloud Storage using Spark-Scala local Program

As given in the below blog, https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported…

scala google-app-engine apache-spark-sql google-cloud-storage google-cloud-dataproc

asked Mar 01 '17 at 14:50

Shawn

537
3
7
16

8

votes

1 answer

How to get path to the uploaded file

I am running an spark cluster on google cloud and I upload a configuration file with each job. What is the path to a file that is uploaded with a submit command? In the example below how can I read the file Configuration.properties before the…

scala apache-spark google-cloud-dataproc

asked Jan 16 '17 at 13:50

orestis

932
2
9
23

Questions tagged [google-cloud-dataproc]