I'm trying to take output from a single process (P1) and perform parallel tasks on it using other processes (P2 and P3). So far so simple.
To do this I'm connecting P2 and P3 to the single out-port of P1. In my mind, this should mean that P1 emits packets through its out port that are picked up by both P2 and P3 simultaneously, in parallel.
What I'm finding is that P2 and P3 aren't started in parallel and instead one of the processes will wait until the other has finished processing (or at least it seems that way to me).
For example, here is a simple graph that should take a JSON input then simultaneously grab a timestamp and parse the JSON. Another timestamp is taken after parsing the JSON and this is used as a basic method for calculating how long the JSON parsing took.
Notice the ordering of the connections going from the ajax/Get
out port (the timestamp connection was added last).
In this case the difference in the timestamps is around 5ms, which roughly lines up with how long the JSON parse takes in a non-NoFlo environment (it's actually a little longer in NoFlo for some reason).
Now take the same graph but this time the connection-order from the ajax/Get
out port has changed (the parse connection was added last):
This time the difference between the timestamps is around 40–50ms, which is clearly a massive difference and far larger than what the parse takes outside of NoFlo.
I'd really appreciate it if someone can shed some light on the following:
- Why are the timings so different depending on the connection order?
- How can I ensure that the 2 connections coming from
ajax/Get
are triggered and run in parallel (ie. they don't wait on each other)?
If it helps, here's a JSON export of the graph from FlowHub.
I've also put together a simple graph using the CLI and have managed to get a better insight into the flow of the graph and perhaps shed some light on what might be causing this:
# This executes in the correct order, though likely by
# coincidence and not due to true parallelisation.
#
# Time1 is run and outputted before Time2.
#
Read(filesystem/ReadFile) OUT -> IN Time1(objects/GetCurrentTimestamp)
Read OUT -> IN Parse(strings/ParseJson)
# This executes the entire Parse path before going back to grab
# and output Time1.
#
# Time1 is run and outputted *after* Time2
# Read doesn't send a disconnect message to Parse until *after*
# Time 1 is outputted.
#
# Read doesn't send a disconnect message to Time1 until *after*
# the Parse path has finished disconnecting.
#
# Read(filesystem/ReadFile) OUT -> IN Parse(strings/ParseJson)
# Read OUT -> IN Time1(objects/GetCurrentTimestamp)
Time1 OUT -> IN Display1(core/Output)
Parse OUT -> IN Time2(objects/GetCurrentTimestamp)
Time2 OUT -> IN Display2(core/Output)
'sample.geojson' -> IN Read
When run with the Read
to Time1
connection defined before Read
to Parse
then the network is in order, though I've noticed that Read
waits until everything else has completed before firing a disconnect (is that right?):
DATA -> ENCODING Read() CONN
DATA -> ENCODING Read() DATA
DATA -> ENCODING Read() DISC
DATA -> IN Read() CONN
DATA -> IN Read() DATA
DATA -> IN Read() DISC
Read() OUT -> IN Time1() CONN
Read() OUT -> IN Time1() < sample.geojson
Read() OUT -> IN Parse() CONN
Read() OUT -> IN Parse() < sample.geojson
Parse() OUT -> IN Time2() CONN
Parse() OUT -> IN Time2() < sample.geojson
Read() OUT -> IN Time1() DATA
Time1() OUT -> IN Display1() CONN
Time1() OUT -> IN Display1() DATA
1422549101639
Read() OUT -> IN Parse() DATA
Parse() OUT -> IN Time2() DATA
Time2() OUT -> IN Display2() CONN
Time2() OUT -> IN Display2() DATA
1422549101647
Read() OUT -> IN Time1() > sample.geojson
Read() OUT -> IN Parse() > sample.geojson
Parse() OUT -> IN Time2() > sample.geojson
Read() OUT -> IN Time1() DISC
Time1() OUT -> IN Display1() DISC
Read() OUT -> IN Parse() DISC
Parse() OUT -> IN Time2() DISC
Time2() OUT -> IN Display2() DISC
If I switch the order so the Read
to Parse
connection is defined first then everything goes wrong and Time1
isn't even sent a packet from Read
until the entire Parse
path has completed (so Time1
is actually after Time2
now):
DATA -> ENCODING Read() CONN
DATA -> ENCODING Read() DATA
DATA -> ENCODING Read() DISC
DATA -> IN Read() CONN
DATA -> IN Read() DATA
DATA -> IN Read() DISC
Read() OUT -> IN Parse() CONN
Read() OUT -> IN Parse() < sample.geojson
Parse() OUT -> IN Time2() CONN
Parse() OUT -> IN Time2() < sample.geojson
Read() OUT -> IN Time1() CONN
Read() OUT -> IN Time1() < sample.geojson
Read() OUT -> IN Parse() DATA
Parse() OUT -> IN Time2() DATA
Time2() OUT -> IN Display2() CONN
Time2() OUT -> IN Display2() DATA
1422549406952
Read() OUT -> IN Time1() DATA
Time1() OUT -> IN Display1() CONN
Time1() OUT -> IN Display1() DATA
1422549406954
Read() OUT -> IN Parse() > sample.geojson
Parse() OUT -> IN Time2() > sample.geojson
Read() OUT -> IN Time1() > sample.geojson
Read() OUT -> IN Parse() DISC
Parse() OUT -> IN Time2() DISC
Time2() OUT -> IN Display2() DISC
Read() OUT -> IN Time1() DISC
Time1() OUT -> IN Display1() DISC
If this is correct behaviour, then how do I run the 2 branches in parallel without one blocking the other?
I've tried making every component asynchronous, I've tried both that and using the WirePattern, I've tried creating multiple out ports and sending the data through all of them at once. No joy – it always comes down to the order in which the first edges are connected. I'm pulling my hair out with this as it's completely blocking my use of NoFlo for ViziCities :(