Questions tagged [google-cloud-dataproc]

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters.

Google Cloud Dataproc is a managed Hadoop MapReduce, Spark, Pig and Hive service on Google Cloud Platform. The service provides GUI, CLI and HTTP API access modes for deploying/managing clusters and submitting jobs onto clusters. This tag can be added to any questions related to using/troubleshooting Google Cloud Dataproc.

Useful links:

1563 questions
0
votes
1 answer

Dataproc node setup

I understand google dataproc clusters are equipped to handle initialization actions - which are executed on creation of every node. However, this is only reasonable for small actions, and would not do well with creating nodes with tons of…
0
votes
1 answer

How to do group by key on custom logic in cloud data flow

I am trying to achieve the Groupby key based on custom object in cloud data flow pipe line. public static void main(String[] args) { Pipeline pipeline = Pipeline.create(PipelineOptionsFactory.create()); List>…
Pavan Tiwari
  • 3,077
  • 3
  • 31
  • 71
0
votes
0 answers

How to effectively process binary files served from ftp and store results on GCS

I need to download about 2 millions of gunzip files from ftp server (not sftp), process them and store results (jpeg images) on google cloud storage. I have considered to spin a dataproc cluster, then get files from ftp and process with Spark. But…
0
votes
0 answers

How to access meta information of a job inside Spark application

I'd like to get notified when a Cloud Dataproc job finished. Unfortunately Cloud Dataproc does not seem to provide hooks or some ways to notify a job's lifecycle, I want to implement the mechanism in my own. I'm planning to push to Pub/Sub when a…
yanana
  • 2,241
  • 2
  • 18
  • 28
0
votes
1 answer

How to read data from bigtable in google data proc

I am tring to read data from Bigtable in Google cloud data proc. Below code i am using to read data from Bigdtable. PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create(); …
0
votes
2 answers

Google cloud spark. One cluster worker remains idle during the whole processing

I am running a job where I combine wikidaa and wikipedia pageviews and I am using a small google cluster of two to three nodes. My problem is that most of the times one node is totally idle although I have tried to increase the parallelism by…
orestis
  • 932
  • 2
  • 9
  • 23
0
votes
0 answers

Performing operations on RDD [(LongWritable),(JsonObject)]

My task is basically: Read data from Google Cloud BigQuery using Spark/Scala. Perform some operation (Like, Update) on the data. Write back the data to BigQuery Till now, I am able to read data from BigQuery using newAPIHadoopRDD() which returns…
0
votes
1 answer

Recommender API with dataproc in production

I am currently trying to build a recommender engine for an ecommerce site. I have come across this which outlines the usage of dataproc. I also got Prediction.io running, which seems to be a neat project to build such services ... although it is a…
wirtsi
  • 105
  • 1
  • 8
0
votes
1 answer

How to submit Google cloud dataproc job from android app

I need to build an Android app which will be used to trigger Google Cloud DataProc API. Thanks in advance
0
votes
1 answer

Google Hadoop Filesystem Encryption

In normal operation one can provide encryption keys to the google storage api to encrypt a given bucket/blob: https://cloud.google.com/compute/docs/disks/customer-supplied-encryption Is this possible for the output of spark/hadoop jobs "on the…
0
votes
1 answer

Running custom spark build on Dataproc?

Is it possible to compile and build custom Apache Spark on Google Cloud Dataproc? Lets say we want to tweak Apace Spark and then want to build custom Spark on dataproc.
0
votes
1 answer

spark-shell and sparkR in Google DataProc

I am very new to Google DataProc We want to run set of code via spark-shell or sparkR for testing purposes. Is it possible to connect to spark cluster and execute the commands in spark-shell or sparkR in google DataProc? I checked the doc and it…
sag
  • 5,333
  • 8
  • 54
  • 91
0
votes
1 answer

yarn status from dataproc client - why is it always a list object?

We have spark jobs run on dataproc cluster with yarn - we also have a wrapper program in python that does constant polling for the job's status and we are monitoring the job state from yarn - as shown as follows: dataproc =…
Howard Xie
  • 113
  • 5
0
votes
1 answer

google-fluentd : change severity in Cloud Logging log_level

We are running spark jobs (lot of spark streaming) on Google cloud Dataproc clusters. we are using cloud logging to collect all the logs generated by spark jobs. currently it is generating lot of "INFO" messages which causes the whole log volumes to…
Remis Haroon - رامز
  • 3,304
  • 4
  • 34
  • 62
0
votes
0 answers

Is it possible to load data generated by .py script hosted on Google Dataproc to local database?

I`m curently working on recommender system and trying to find optimal design solution for this problem. I want to deploy my python script with recommender engine to Spark cluster provided by Google Dataproc. Is it possible to load…
Maria
  • 195
  • 1
  • 11
1 2 3
99
100