I think this might be a quite common problem, but somehow it is hard to find suitable answers.
Background I'm currently investigating how to speed up the node variant of Patternlab. It is basically a system where you create HTML template files containing some fundamental elements (called atoms) like headings, paragraphs, etc. and compose them into something bigger (molecules), then compose multiple molecules until you have a full web page. The template files are processed by a template engine which is able to reference and include other template files. Patternlab now would read all files and build a graph of each file's predecessors and ancestors.
The problem is, all files in the template directory(s) are read, processed, the compile output from the template engine is written somewhere, done. This takes about 1 Minute for 10-20 MB of HTML Templates (think e.g. of Mustache). It would be much nicer to have differential updates.
General problem
Given a directed acyclic graph of files(nodes) f
and f1
→ f2
for f1
includes f2
in its contents. When in f1
→ f2
needs also to be recompiled(i.e. the compilation order should be f2
, f1
) , how to find the set of all files that need recompilation efficiently and what are the initial and target nodes, when there might be a path containing multiple changed files and nodes depending on multiple paths?
Given two parallel paths (i.e. f1
→ f2
→ f5
and f1
→ f4
→ f5
) how to find the best set of paths for parallel compilation (i.e. by total length of the path) with a minimum length l
?
Considerations
So how could this be done? We simply have to watch the template directories for the events create
, modify
and delete
. This could be done by collecting file system events from the kernel.
Remember that for any file we can also receive the "last modified" timestamp (mtime
).
The first step, building an initial graph is already done by patternlab. Each node of the graph consists of:
- the source file path and
mtime
- the target file path and
mtime
- a set of parameters for the file (i.e. username, page title, etc.)
- a
changed
flag that indicates if the file was modified by inspecting themtimes
(see below). This always reflects the current file system state (much simpler).
i.e. there is a "shadow" graph compiling the most current compile output. This leads to these cases:
- source.mtime <= target.mtime: The source was compiled and not modified afterwards. If the source file is changed after building the graph,
- source.mtime > target.mtime: The source was changed before building the graph and must be recompiled
Now the idea is to recompile each node if the changed flag is set and return the contents of the target file otherwise.
- When a file is compiled, all files referencing this file (e.g. all nodes with incoming edges to the node) must be recompiled as well.
- Consider that when the source hsa been compiled to the target, the flag changes to false so it does not get compiled twice. Returning the target content when visited again is okay though.
- There are multiple root nodes (only outgoing edges, i.e. pages),some that are "leafs" (only incoming edges, i.e. molecules) and also nodes with no
- Assume that patternlab cannot run in paralell, i.e. there is a "busy" flag blocking concurrent overall runs.
- However it might be beneficial to use all CPU cores for compiling templates faster and employ content caching for templates that were not modified.
- If a file
f1
includes two other filesf2
,f3
andf2
,f3
both have files that need recompilations as successors,f1
must wait for all paths to be recompiled first.
For instance (c = changed)
f5 -> f6 (c)
^
/
f1 -> f2 -> f3 -> f4(c)