Questions tagged [google-hadoop]

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

References:

71 questions
0
votes
1 answer

JobTracker - High memory and native thread usage

We are running hadoop on GCE with HDFS default file system, and data input/output from/to GCS. Hadoop version: 1.2.1 Connector version: com.google.cloud.bigdataoss:gcs-connector:1.3.0-hadoop1 Observed behavior: JT will accumulate threads in waiting…
0
votes
1 answer

Hadoop on Google Compute Engine: how to add external software

I need to set up an Hadoop cluster on Google Compute Engine. While it seems straightforward either using the web console Click&Deploy or via the command line tool bdutil, my concern is that my jobs require additional dependencies present on the…
0
votes
1 answer

What causes flume with GCS sink to throw a OutOfMemoryException

I am using flume to write to Google Cloud Storage. Flume listens on HTTP:9000. It took me some time to make it work (add gcs libaries, use a credentials file...) but now it seems to communicate over the network. I am sending very small HTTP request…
0
votes
1 answer

Failed to copy Hadoop and Java packages to Google Cloud Storage

I am trying to setup a Hadoop cluster on Google Compute Engine, and I have been following these instructions. Everything seems to have worked just fine until I ran: ./compute_cluster_for_hadoop.py setup with my project ID…
0
votes
1 answer

What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

I would like to write data from flume-ng to Google Cloud Storage. It is a little bit complicated, because I observed a very strange behavior. Let me explain: Introduction I've launched a hadoop cluster on google cloud (one click) set up to use a…
0
votes
1 answer

Google cloud click to deploy hadoop

Why does google cloud click to deploy hadoop workflow requires picking size for local persistent disk even if you plan to use the hadoop connector for cloud storage? The default size is 500 GB .. I was thinking if it does need some disk it should be…
0
votes
1 answer

Spark job seems not to parallelize well

Using Spark 1.1 I have a job that does as follows: Reads a list of folders under a given root, parallelize the list For each folder, read the files under it - these are gzipped files For each file, extract the content - these are lines, each line…
Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26
0
votes
1 answer

Memory issues when running Spark job on relatively large input

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark). I am running on several input folders, I estimate the size of input to be ~250GB gz compressed. Although it seems…
Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26
0
votes
0 answers

GCS Connector Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

We are trying to run Hive queries on HDP 2.1 using GCS Connector, it was working fine until yesterday but since today morning our jobs are randomly started failing. When we restart them manually they just work fine. I suspect it's something to do…
0
votes
1 answer

Array in output schema caused exception

I am following this WordCount example using the Google BigQuery-Hadoop connector: https://developers.google.com/hadoop/writing-with-bigquery-connector#completecode The example works fine as it is. To test array in the output schema, I have altered…
user3709284
  • 323
  • 1
  • 2
  • 10
0
votes
1 answer

Hadoop cluster on Google cloud platform doesn't start

I'm trying to create a Hadoop cluster in the Google Cloud Platform using the following resources: https://cloud.google.com/solutions/hadoop/ https://github.com/GoogleCloudPlatform/solutions-google-compute-engine-cluster-for-hadoop After setting up…
1 2 3 4
5