0

I want to get a Samza job running on a remote system with the Samza job being stored on HDFS. The example (https://samza.apache.org/startup/hello-samza/0.7.0/) for running a Samza job on a coal machine involves building a tar file, then unzipping the tar file, then running a shell script that's located within the tar file.

The example here for HDFS is not really well-documented at all (https://samza.apache.org/learn/tutorials/0.7.0/deploy-samza-job-from-hdfs.html). It says to copy the tar file to HDFS, then to follow the other steps in the non-HDFS example.

That would imply that the tar file that now resides on HDFS needs to be untarred within HDFS, then a shell script to be run on that unzipped tar file. But you can't untar a HDFS tar file with the hadoop fs shell...

Without untarring the tar file, you don't have access to run-job.sh to initiate the Samza job.

Has anyone managed to get this to work please?

John
  • 10,837
  • 17
  • 78
  • 141

1 Answers1

0

We deploy our Samza jobs this way: we have hadoop libraries in /opt/hadoop, we have Samza sh scripts in /opt/samza/bin and we have Samza config file in /opt/samza/config. In this config file there is this line:

yarn.package.path=hdfs://hadoop1:8020/deploy/samza/samzajobs-dist.tgz

When we wanna deploy new version of our Samza job we just create the tgz archive, we move it (without untaring) to HDFS to /deploy/samza/ and we run /opt/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file:///opt/samza/config/$CONFIG_NAME.properties

The only downside is that we ignore config files in the archive. If you change the config in the archive it does not take an effect. You have to change the config files in /opt/samza/config. On the other side we are able to change config of our Samza job without deploying the new tgz archive. The shell scripts under /opt/samza/bin remains the same every build thus you don't need to untar the archive package because of the shell scripts.

Good luck with Samzing! :-)

Lukáš Havrlant
  • 4,134
  • 2
  • 13
  • 18
  • Perfect, thank you very much. Does the machine you run the job from (by calling `run-job.sh`) need to have resource manager running? – John Nov 01 '15 at 15:00
  • We run the Resource Manager on the same machine but I am not sure if it is necessary. It's more a Hadoop question and I don't know the Hadoop that much. Sorry :-(. It should be easy to test it though :). – Lukáš Havrlant Nov 02 '15 at 20:17
  • OK, I seem to be getting errors when I run it from a non-RM machine. Are you using ResourceManager in HA? If so, what are you setting `yarn.resourcemanager.hostname` in `yarn-site.xml` to be? – John Nov 03 '15 at 11:26
  • Yes, we are. We usually run everything on private IP, i. e. 10.*.*.* or 192.*.*.*. More specifically we use domains thus our YARN uses `hadoop1` as hostname and we have translation to IP in all `/etc/hosts` files on all servers. – Lukáš Havrlant Nov 03 '15 at 20:05