Questions tagged [google-hadoop]

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

References:

71 questions
1
vote
1 answer

Adding or removing nodes from an existing GCE hadoop/spark cluster with bdutil

I'm getting started with running a spark cluster on google compute engine backed by google cloud storage that is deployed with bdutil (on the GoogleCloudPlatform github), I am doing this as follows: ./bdutil -e…
Gavin
  • 98
  • 8
1
vote
1 answer

Using ignoreUnknownValues from Hadoop BigQuery Connector

I'm piping unstructured event data through Hadoop and want to land it in BigQuery. I have a schema that includes most of the fields, but there are some fields I want to ignore or don't know about. BigQuery has a configuration field called…
tmandry
  • 1,295
  • 16
  • 33
1
vote
1 answer

Google Cloud Hadoop Nodes not yet sshable error

I ran the following commands referring to https://cloud.google.com/hadoop/setting-up-a-hadoop-cluster on cygwin. gsutil.cmd mb -p [projectname] gs://[bucketname] ./bdutil -p [projectname] -n 2 -b [bucketname] -e hadoop2_env.sh …
1
vote
1 answer

Hadoop on Google Compute Engine

I am trying to setup hadoop cluster in Google Compute Engine through "Launch click-to-deploy software" feature .I have created 1 master and 1 slave node and tried to start the cluster using start-all.sh script from master node and i got error…
1
vote
1 answer

Strange errors when running a Spark job

I am running a spark cluster with 80 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark). I am running on several input folders, I estimate the size of input to be ~250GB gz compressed. I get errors in the…
Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26
1
vote
1 answer

Fail to run Spark job when using globStatus and Google Cloud Storage bucket as input

i am using Spark 1.1. I have a Spark job that seeks for a certain pattern of folders only under a bucket (i.e. folders that start with...), and should process only those. I achieve this by doing the following: FileSystem fs = FileSystem.get(new…
1
vote
1 answer

Issues Google Cloud Storage connector on Spark

I am trying to install the Google Cloud Storage on Spark on Mac OS to do local testing of my Spark app. I have read the following document (https://cloud.google.com/hadoop/google-cloud-storage-connector). I have added…
poiuytrez
  • 21,330
  • 35
  • 113
  • 172
1
vote
1 answer

Maintaining persistent HDFS in Google Cloud

I am having my students use bdutil to create a Google Compute Engine cluster with persistent disks and HDFS as the default filesystem. We want to have persistent disks so that the students can work on projects over a period of weeks. However, HDFS…
1
vote
3 answers

Unable to SSH into VM causing problems with Hadoop install using bdutil

I have been through most of Questions surrounding this issue on this site however nothing seems to have helped me. Basically what I am trying to do is instantiate a Hadoop instance on my VM via the bdutil script supplied by Google , however the…
0
votes
1 answer

Hive external table location in google cloud storage is ignoring subdirectories

I have a bunch of large csv.gz files in google cloud storage that we got from an external source. We need to bring this in BigQuery so we can start querying but BigQuery cannot directly ingest CSV GZIPPED files larger than 4GB. So, I decided to…
jatinw21
  • 1
  • 2
0
votes
1 answer

Google BigQuery Spark Connector: How to ignore unknown values on append

We use the Google BigQuery Spark Connector to import data stored in Parquet files into BigQuery. Using custom tooling we generated a schema file needed by BigQuery and reference that in our import code (Scala). However, our data doesn't really…
0
votes
1 answer

Google Hadoop Filesystem Encryption

In normal operation one can provide encryption keys to the google storage api to encrypt a given bucket/blob: https://cloud.google.com/compute/docs/disks/customer-supplied-encryption Is this possible for the output of spark/hadoop jobs "on the…
0
votes
1 answer

(bdutil) Unable to get hadoop/spark cluster working with a fresh install

I'm setting up a tiny cluster in GCE to play around with it but although instances are created some failures prevent to get it working. I'm following the steps in https://cloud.google.com/hadoop/downloads So far I'm using (as of now) lastest…
0
votes
1 answer

Google Cloud connector for Hadoop doesn't work with Pig

I'm using Hadoop with HDFS 2.7.1.2.4 and Pig 0.15.0.2.4 (Hortonworks HDP 2.4) and trying to use Google Cloud Storage Connector for Spark and Hadoop (bigdata-interop on GitHub). It works correctly when I try, say, hadoop fs -ls gs://bucket-name But…
sckol
  • 131
  • 2
  • 13
0
votes
1 answer

Never successfully built a large hadoop&spark cluster

I was wondering if anybody could help me with this issue in deploying a spark cluster using the bdutil tool. When the total number of cores increase (>= 1024), it failed all the time with the following reasons: Some machine is never sshable, like…
Parthus
  • 1
  • 2