0

I have three coordinators A, B and C.

The coordinator of B and C depends on the output of A. That is, if the output of A is ready, coordinator of B and C will run.

So, I use an input-event to control such dependency.

The structure of coordinator B and C look like

<coordinator-app name="B" frequency="1440" start=${start} end=${end} timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
   <datasets>
      <dataset name="input1" frequency="1440" initial-instance=${start} timezone="UTC">
         <uri-template>hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}</uri-template>
      </dataset>
   </datasets>
   <input-events>
      <data-in name="coordInput1" dataset="input1">
          <instance>${coord:current(0)}</instance>
      </data-in>
   </input-events>
   <action>
      <workflow>
         <app-path>hdfs://localhost:9000/B/workflows</app-path>
      </workflow>
   </action>     
</coordinator-app>

So, if hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/_SUCCESS is created, the coordinator B and C will be triggered to run their workflow.

the coordinator of A looks like:

<coordinator-app name="B" frequency="1440" start=${start} end=${end} timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
   <action>
      <workflow>
         <app-path>hdfs://localhost:9000/A/workflows</app-path>
      </workflow>
   </action>
</coordinator-app>

its ${start} and ${end} are same as B and C.

The workflow of A will create hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/_SUCCESS

However, the coordinator of B and C are still waiting for the hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}/_SUCCESS

Even though I use output event of coordinator of A, the workflow of B and C are still wanting for the created input dataset.

<coordinator-app name="A" frequency="1440" start=${start} end=${end} timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
    <datasets>
        <dataset name="output1" frequency="1440" initial-instance=${start} timezone="UTC">
        <uri-template>hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}</uri-template>
        </dataset>
    </datasets>

    <output-events>
        <data-out name="coordOutput1" dataset="output1">
            <instance>${coord:current(0)}</instance>
        </data-out>
    </output-events>
   <action>
      <workflow>
         <app-path>hdfs://localhost:9000/A/workflows</app-path>
      </workflow>
   </action>
</coordinator-app>

However, if I submit the workflow of A without its coordinator, then the workflow of B and C will be triggered as expected.

I am not sure if something missing in my coordinator of A.

Thank you!

alec.tu
  • 1,647
  • 2
  • 20
  • 41
  • **Q1.** In your code sample, Coordinator A is named "B". Is that a copy/paste typo? **Q2.** Why don't you just trigger 2 sub-workflows in parallel (*B* and *C*) when *A* has completed - because *A* may want to restart before *B* and/or *C* are done? – Samson Scharfrichter Sep 04 '15 at 16:51
  • yes! it's a typo. Is it possible to trigger workflows directly from one workflow and run each of them in parallel ( B and C are run in parallel) with oozie? I am not sure how to do that. – alec.tu Sep 05 '15 at 00:40
  • Have a look at "Fork/Join" Actions and "Sub-Workflows" Actions. For example in that old but comprehensive tutorial: http://www.infoq.com/articles/oozieexample *-- then tag my comment as useful if it was :-)* – Samson Scharfrichter Sep 05 '15 at 00:53
  • Thanks. I will try it in the next two days. Anyway, I still do not understand why my approach did not work and it will work if I submit the workflow of A without its coordinator. I think the coordinator of B and C will always monitor the directory ```hdfs://localhost:9000/tmp/revenue_feed/${YEAR}/${MONTH}/${DAY}``` and they will be triggered if the directory is ready. – alec.tu Sep 05 '15 at 01:13
  • Disclaimer: I feel lucky never to have had to mess with these "datasets"... – Samson Scharfrichter Sep 05 '15 at 01:44

1 Answers1

0

The reason is that you did not specified the done-flag, so oozie uses the default: _success

done-flag: The done file for the data set. If done-flag is not specified, then Oozie configures Hadoop to create a _SUCCESS file in the output directory. If the done flag is set to empty, then Coordinator looks for the existence of the directory itself.

You should add an empty

<done-flag></done-flag>

to the dataset.

kecso
  • 2,387
  • 2
  • 18
  • 29
  • In fact, the workflow of A created _success file in its output directory but the other dependent workflows did not run. – alec.tu Jan 30 '16 at 07:21
  • Ohh I see and you also mentined it... Really strange. Let me replicate your issue. – kecso Feb 01 '16 at 04:10