Building and processing a compile graph on a set of watched source files

Question

I think this might be a quite common problem, but somehow it is hard to find suitable answers.

Background I'm currently investigating how to speed up the node variant of Patternlab. It is basically a system where you create HTML template files containing some fundamental elements (called atoms) like headings, paragraphs, etc. and compose them into something bigger (molecules), then compose multiple molecules until you have a full web page. The template files are processed by a template engine which is able to reference and include other template files. Patternlab now would read all files and build a graph of each file's predecessors and ancestors.

The problem is, all files in the template directory(s) are read, processed, the compile output from the template engine is written somewhere, done. This takes about 1 Minute for 10-20 MB of HTML Templates (think e.g. of Mustache). It would be much nicer to have differential updates.

General problem Given a directed acyclic graph of files(nodes) f and f1 → f2 for f1 includes f2 in its contents. When in f1 → f2 needs also to be recompiled(i.e. the compilation order should be f2, f1) , how to find the set of all files that need recompilation efficiently and what are the initial and target nodes, when there might be a path containing multiple changed files and nodes depending on multiple paths?

Given two parallel paths (i.e. f1 → f2 → f5 and f1 → f4 → f5) how to find the best set of paths for parallel compilation (i.e. by total length of the path) with a minimum length l?

Considerations So how could this be done? We simply have to watch the template directories for the events create, modify and delete. This could be done by collecting file system events from the kernel.

Remember that for any file we can also receive the "last modified" timestamp (mtime).

The first step, building an initial graph is already done by patternlab. Each node of the graph consists of:

the source file path and mtime
the target file path and mtime
a set of parameters for the file (i.e. username, page title, etc.)
a changed flag that indicates if the file was modified by inspecting the mtimes(see below). This always reflects the current file system state (much simpler).

i.e. there is a "shadow" graph compiling the most current compile output. This leads to these cases:

source.mtime <= target.mtime: The source was compiled and not modified afterwards. If the source file is changed after building the graph,
source.mtime > target.mtime: The source was changed before building the graph and must be recompiled

Now the idea is to recompile each node if the changed flag is set and return the contents of the target file otherwise.

When a file is compiled, all files referencing this file (e.g. all nodes with incoming edges to the node) must be recompiled as well.
Consider that when the source hsa been compiled to the target, the flag changes to false so it does not get compiled twice. Returning the target content when visited again is okay though.
There are multiple root nodes (only outgoing edges, i.e. pages),some that are "leafs" (only incoming edges, i.e. molecules) and also nodes with no
Assume that patternlab cannot run in paralell, i.e. there is a "busy" flag blocking concurrent overall runs.
However it might be beneficial to use all CPU cores for compiling templates faster and employ content caching for templates that were not modified.
If a file f1 includes two other files f2, f3 and f2, f3 both have files that need recompilations as successors, f1 must wait for all paths to be recompiled first.

For instance (c = changed)

    f5 -> f6 (c)
    ^
   /
f1 -> f2 -> f3 -> f4(c)

Encoding a set of dependency relations into a (partial order) graph isn't a new idea. Executing actions associated with each node, and then executing actions on downstream nodes isn't new either. (You can see both of the capabilities with unix/gnu Make files). You seem to sort of say this. Where's the question you want answered? — Ira Baxter, Oct 17 '16 at 03:14
How do other software projects solve this from the algorithmic side? How is parallel compilation done efficiently when joint nodes are blocking the operations? What are the pitfalls? I simply could not come up with a set of search terms that would return something useful. — user3001, Oct 17 '16 at 13:08
The problem is that everything is encapsulated inside a bunch of NodeJS scripts, some of which I cannot change, so make would probably not get me anywhere. I know it would be sort of reinventing the wheel and thats why I'm asking how precisely it can be done. — user3001, Oct 17 '16 at 13:13
If you can't change them, you don't need to model their change status and they don't have any downstream impact. Why is this an issue? — Ira Baxter, Oct 17 '16 at 20:35

Building and processing a compile graph on a set of watched source files

0 Answers0