PDI jobs not seen as Mapreduce jobs in Resource Manager or Job History server

Question

I am using Pentaho 5.4 and EMR 3.4

When I execute a transformation in Pentaho to copy data from Oracle DB to HDFS on EMR, I don't see any MR jobs in Resource manager of the Hadoop(EMR) cluster.

Am I supposed to see them as MR jobs or pentaho just copies without creating any MR jobs..?

When will pentaho use Mapreduce to process data?

score 0 · Answer 1 · answered May 05 '16 at 21:37

Not sure if you've figured this out already, but you will need to use the Pentaho MapReduce component in your KJB: Pentaho MapReduce

You can then define Mapper, Combiner, and Reducer transformations and also a NamedCluster (XML) configuration in which you would specify the JobTracker host, port, etc. What Pentaho does is copy its engine into each node in your cluster (default in /opt/pentaho/) and submit jobs as the user you specify in Spoon and then you will be able to see them in the job history.

In your scenario it sounds like you're using a DB connection plus a different component to do your ingest into an HDFS file output.

PDI jobs not seen as Mapreduce jobs in Resource Manager or Job History server

1 Answers1