Questions tagged [google-hadoop]

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

The open-source Apache Hadoop framework can be run on Google Cloud Platform for large-scale data processing, using Google Compute Engine VMs and Persistent Disks and optionally incorporating Google's tools and libraries for integrating Hadoop with other cloud services like Google Cloud Storage and BigQuery.

References:

71 questions
2
votes
2 answers

Where is the source of datastore-connector-latest.jar? Could I add this as a maven dependency?

I got connectors from https://cloud.google.com/hadoop/datastore-connector But I'm trying to add the datastore-connector (and bigquery-connector too) as a dependency in the pom... I don't know if it this is possible. I could not find the right…
2
votes
1 answer

NullPointerException running a Spark job

I am running a job on Spark in standalone mode, version 1.2.0 The first operation I am doing is taking an RDD of folder paths, and generating an RDD of file names, composed of the files reside in each folder: JavaRDD filePaths =…
Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26
2
votes
1 answer

Spark - "too many open files" in shuffle

Using Spark 1.1 I have 2 datasets. One is very large and the other was reduced (using some 1:100 filtering) to much smaller scale. I need to reduce the large dataset to the same scale, by joining only those items from the smaller list with their…
Yaniv Donenfeld
  • 565
  • 2
  • 8
  • 26
2
votes
1 answer

Getting 'sudo: unknown user: hadoop' and 'sudo: unable to initialize policy plugin error' on Google Cloud Platform while running hadoop cluster

I am trying to deploy the sample Hadoop app provided by Google at https://github.com/GoogleCloudPlatform/solutions-google-compute-engine-cluster-for-hadoop on Google Cloud Platform. I followed all the setup instructions given there step-by-step. I…
2
votes
1 answer

Hadoop 2.4.1 and Google Cloud Storage connector for Hadoop

I am trying to run Oryx on top of Hadoop using Google's Cloud Storage Connector for Hadoop: https://cloud.google.com/hadoop/google-cloud-storage-connector I prefer to use Hadoop 2.4.1 with Oryx, so I use the hadoop2_env.sh set-up for the hadoop…
Rich
  • 132
  • 9
2
votes
1 answer

How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine

I am trying to run Hadoop Job on Google Compute engine against our compressed data, which is sitting on Google Cloud Storage. While trying to read the data through SequenceFileInputFormat, I get the following…
1
vote
2 answers

GCS - Global Consistency with delete + rename

My issue may be a result of my misunderstanding with global consistency in google storage, but since I have not experienced this issue until just recently (mid November) and now it seems easily reproducible, I wanted some clarification. The issue…
lukeforehand
  • 750
  • 6
  • 11
1
vote
1 answer

GoogleHadoopFileSystemBase.setTimes() not working

I have a reference to the GoogleHadoopFileSystemBase in my java code, and I’m trying to call setTimes(Path p, long mtime, long atime) to modify the timestamp of a file. It doesn’t seem to be working though, even though other FileSystem apis work…
Alvin C
  • 47
  • 1
  • 6
1
vote
1 answer

Spark - Can't read files from Google Cloud Storage when configuring gcs connector manually

I have a Spark Cluster deployed using bdutil for Google Cloud. I installed a GUI on my driver instance to be able to run IntelliJ from it, so that I can try to run my Spark processes in interactive mode. The first issue I faced was that the…
Gouffe
  • 161
  • 1
  • 10
1
vote
0 answers

Loading data into Google Datastore kind from local hdfs(local machine) using google-datastore-connector for hadoop?

I have used google-cloud-storage-connector for Hadoop and able to run mapreduce job that takes input from my local HDFS (Hadoop running in my local machine) and places the result in Google Cloud Storage bucket. Now I want to run a mapreduce job…
1
vote
0 answers

want help for running MapReduce programs on Google Cloud storage

I am using Google Cloud Storage for Hadoop 2.3.0 using GCS connector. I have added GCS.jar to lib directory of my hadoop installation an added path to GCS connector in hadoop-env.sh file as: export…
1
vote
1 answer

Connect hadoop cluster to mutiple Google Cloud Storage backets in multiple Google Projects

It is possible, to connect my Hadoop cluster to multiple Google Cloud Projects at once ? I can easly use any Google Storage bucket in single Google Project via Google Cloud Storage Connector as explained in this thread Migrating 50TB data from local…
1
vote
2 answers

Google Cloud Engine : LibSnappy not installed errur during command-line installation of Hadoop

I'm trying to install a custom Hadoop implementation (>2.0) on Google Compute Engine using the command line option. The modified parameters of my bdutil_env.sh file are as…
1
vote
1 answer

What is the number of reducer slots on GCE Hadoop worker nodes?

I am testing the scaling of some MapReduce jobs on Google Compute Engine's Hadoop cluster, and finding some unexpected results. In short, I've been told this behavior may be explained by a having a multiple number of reducer slots per each worker…
Rich
  • 132
  • 9
1
vote
1 answer

Hive queries of external tables stored on Google Cloud Storage extremely slow

I have begun testing The Google Cloud Storage connector for Hadoop. I am finding it incredibly slow for hive queries run against it. It seems a single client must scan the entire file system before starting the job, 10s of 1000s of files this takes…
Sean
  • 61
  • 6