I am trying to get Oryx up and running on Google Compute Engine. I created a new instance and installed Oryx via:
git clone https://github.com/cloudera/oryx.git
cd oryx
mvn -DskipTests install
and saved this install as an image on Google Compute Engine ("oryx-image").
Finding issues with Oryx and the Google File System (Hadoop 2.4.1 and Google Cloud Storage connector for Hadoop) I have been using hdfs:// as the default file system.
Finding issues with the default Hadoop package launched on Google Compute Engine (e.g., no Snappy libraries, which are needed for the default Oryx configuration), I have also tried creating my own Hadoop 2.4.1 tarball with Snappy included following these instructions: How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine (side note: is the jdk version described here sufficient for oryx?). I have then used my saved image with oryx installed ("oryx-image"):
./bdutil --bucket <some-bucket> --image oryx-image -n $number \
--env_var_files hadoop2_env.sh --default_fs hdfs
and my saved Hadoop tarball:
# File: hadoop2_env.sh
HADOOP_TARBALL_URI="gs://<some-bucket>/hadoop-2.4.1.tar.gz"
to deploy a Hadoop 2.4.1 (with Snappy) cluster (with default file system = hdfs://) on Google Compute Engine. Still no luck.
I can successfully run test Hadoop jobs on GCE, test Snappy implementations on GCE (see second link), and test Oryx jobs on GCE locally from the master node:
# File: oryx.conf
model.local-data = true
model.local-computation = true
The only issue is getting Oryx to successfully run on Google Compute Engine with data in either hdfs:// or gs://.
I have found many varying instructions for environmental variable changes, etc., and I don't know which ones are necessary, and which ones may be leading to more problems. I was wondering if there is documentation on installing/running oryx on GCE. Perhaps someone has gone through the same process already and can offer instruction and/or at least confirm a successful install?
The instructions (found in second link) for installing Hadoop 2.4.1 with Snappy on GCE were superb. I was hoping to find something with that level of detail regarding all the steps necessary to make oryx work on GCE from scratch.
Thanks!