Successful Oryx Install on Google Compute Engine?

Question

I am trying to get Oryx up and running on Google Compute Engine. I created a new instance and installed Oryx via:

git clone https://github.com/cloudera/oryx.git
cd oryx
mvn -DskipTests install

and saved this install as an image on Google Compute Engine ("oryx-image").

Finding issues with Oryx and the Google File System (Hadoop 2.4.1 and Google Cloud Storage connector for Hadoop) I have been using hdfs:// as the default file system.

Finding issues with the default Hadoop package launched on Google Compute Engine (e.g., no Snappy libraries, which are needed for the default Oryx configuration), I have also tried creating my own Hadoop 2.4.1 tarball with Snappy included following these instructions: How to enable Snappy/Snappy Codec over hadoop cluster for Google Compute Engine (side note: is the jdk version described here sufficient for oryx?). I have then used my saved image with oryx installed ("oryx-image"):

./bdutil --bucket <some-bucket> --image oryx-image -n $number \
    --env_var_files hadoop2_env.sh --default_fs hdfs

and my saved Hadoop tarball:

# File: hadoop2_env.sh
HADOOP_TARBALL_URI="gs://<some-bucket>/hadoop-2.4.1.tar.gz"

to deploy a Hadoop 2.4.1 (with Snappy) cluster (with default file system = hdfs://) on Google Compute Engine. Still no luck.

I can successfully run test Hadoop jobs on GCE, test Snappy implementations on GCE (see second link), and test Oryx jobs on GCE locally from the master node:

# File: oryx.conf
model.local-data = true
model.local-computation = true

The only issue is getting Oryx to successfully run on Google Compute Engine with data in either hdfs:// or gs://.

I have found many varying instructions for environmental variable changes, etc., and I don't know which ones are necessary, and which ones may be leading to more problems. I was wondering if there is documentation on installing/running oryx on GCE. Perhaps someone has gone through the same process already and can offer instruction and/or at least confirm a successful install?

The instructions (found in second link) for installing Hadoop 2.4.1 with Snappy on GCE were superb. I was hoping to find something with that level of detail regarding all the steps necessary to make oryx work on GCE from scratch.

Thanks!

score 2 · Answer 1 · answered Oct 17 '14 at 18:41

2

I don't know if this is a direct answer, but I can comment on a few points here. I think a lot of the issues here are getting a standard Hadoop installation up and running on GCE.

I have never run it on GCE, but this shouldn't directly matter whether it runs on bare metal or GCE or EC2. It just uses Hadoop. Yes it does assume Hadoop though, and HDFS. (I think the hard-coding hdfs:// could be removed, sure; I don't know if this would make it work with non-HDFS file systems.) So if GCE has a different filesystem by default, yes your best bet is to use HDFS.

I suppose I think of Snappy as a required part of a Hadoop installation. If you're installing Hadoop by hand, yes I think you have to take a few more steps. This is why I'd recommend a (free, open source) distro that takes care of this for you.

It should also set up things like HADOOP_CONF_DIR for you, which, hm, I also tend to think of as a required part of a Hadoop setup in general, at least on the client side.

Any version of Java 6 or later is fine.

Is it possible to try a distro? it may be much less pain. I'm sorry I don't have further instructions here but it seems like a GCE<->Hadoop issue more that Hadoop<->Oryx. If the app can change in ways to make it accommodate GCE better I can do that.

answered Oct 17 '14 at 18:41

Sean Owen

66,182
23
141
173

Thanks for the input! I installed CDH5 in pseudo-distributed mode on a single GCE VM and had success. So the GCE<->Hadoop diagnosis is probably correct. The problem I am facing is getting GCE to spin up a cluster with an adequate version of Hadoop to work with Oryx. Are you aware of anyone in the oryx-user community having success with this? Of course modification to Oryx to work with the default version of Hadoop 2.4.1 provided by GCE would solve my problem as well! – Rich Oct 20 '14 at 21:12
What problem do you see? it works with about any reasonable Hadoop version, although as with any Hadoop app, a different build is needed for Hadoop 1.x-ish versions vs Hadoop 2.x-ish versions. You may have the wrong build. I use it with Hadoop 2.5.1 at the moment. – Sean Owen Oct 20 '14 at 22:27
Current errors begin with: Oryx-/user/rich-0-BuildTreesStep: Text(hdfs://total-cdh-m:8020/user/rich/00000/inbound)+dis... ID=1 (1/1)(1): Job failed!... com.cloudera.oryx.computation.common.JobException: Oryx-/user/rich-0-BuildTreesStep failed in state FAILED. **A fundamental question:** Google deploys a cluster with a master node and worker nodes, with Hadoop installed. If I run an oryx job from master node, must oryx be installed on **all** nodes or just master? i.e., I know oryx runs on top of distributed hadoop system, but does oryx installation need to be distributed across cluster? – Rich Oct 22 '14 at 16:09
Oryx is just a Java binary. It doesn't install anywhere. You just run it. It can be on any machine that has Hadoop config that points to the cluster's resources. It doesn't even need to be part of the cluster. The error just says "something failed" and isn't the underlying cause. – Sean Owen Oct 22 '14 at 21:40

score 0 · Answer 2 · answered Oct 29 '14 at 20:38

I found a not-so-elegant "solution" to this problem. The standard issue Hadoop-2.4.1 provided by Google Compute Engine did actually have snappy libraries, they just weren't in the "right" place. So I copied all of the snappy library files from their default location (/usr/lib/) to the java library directory. Obviously only one of these lines are needed, but I haven't taken the time to discover which one is the right one:

sudo cp /usr/lib/lib* /usr/local/lib
sudo cp /usr/lib/lib* /usr/java/jdk1.7.0_55/lib/amd64/jli
sudo cp /usr/lib/lib* /usr/java/jdk1.7.0_55/lib/amd64
sudo cp /usr/lib/lib* /usr/java/jdk1.7.0_55/lib

And of course this isn't so much as a solution, as a work around. I suppose adding the snappy library directory to the correct path would work too.

Successful Oryx Install on Google Compute Engine?

2 Answers2