1

I'd like to use make to process a large number of inputs to outputs using a script (python, say.) The problem is that the script takes an incredibly short amount of time to run per input, but the initialization takes a while (python engine + library initialization.) So, a naive makefile that just has an input->output rule ends up being dominated by this initialization time. Parallelism doesn't help with that.

The python script can accept multiple inputs and outputs, as so:

python my_process -i in1 -o out1 -i in2 -o out2 ...

and this is the recommended way to use the script.

How can I make a Makefile rule that best uses my_process, by sending in out of date input-output pairs in batches? Something like parallel but aware of which outputs are out of date.

I would prefer to avoid recursive make, if at all possible.

Ben Braun
  • 418
  • 4
  • 10
  • Sounds like a job for grouped targets. Can't try it myself because the distro is stuck on an older version of GNU Make. https://www.gnu.org/software/make/manual/html_node/Multiple-Targets.html – Andreas Jul 09 '20 at 14:30
  • @Andreas No, grouped targets won't help here. For one, grouped targets currently don't work with pattern rules (I think this is bcs. this would be a very unusual pattern where a N:N:1 triple relation would be needed), and secondly you don't get the list of targets which need updating out of a grouped target, at least not with the usual special variables. – Vroomfondel Jul 10 '20 at 10:32
  • @Vroomfondel Yeah, patterns won't work. Was thinking something very similar to your answer, with a patsubst on SOURCES to get the out files rather than the placeholder target "lastrun". – Andreas Jul 10 '20 at 17:41

2 Answers2

2

I don't completely grasp your problem: do you really want make to operate in batches or do you want a kind of perpetual make process checking the file system on the fly and feeding to the Python process whenever it finds necessary? If the latter, this is quite the opposite of a batch mode and rather a pipeline.

For the batch mode there is a work-around which needs a dummy file recording the last runnning time. In this case we are abusing make for because the makefile is in this part a one-trick pony which is unintuitive and against the good rules:

SOURCES := $(wildcard in*)                                                                                                                                                                                                                                                                                                                                                                                                                                                                
lastrun : $(SOURCES)
        python my_process $(foreach src,$?,-i $(src) -o $(patsubst in%,out%,$(src)))
        touch lastrun                                                                                                                                                                                                                        

PS: please note that this solution has a substantial flaw in that it doesn't detect the update of in-files when they happen during the run of the makefile. All in all it is more advisable to simply collect the filenames of the in-files which were updated by the update process itself and avoid make althogether.

Vroomfondel
  • 2,704
  • 1
  • 15
  • 29
  • This appears to run my_process once for _all_ inputs, which doesn't take advantage of make's parallelism, and furthermore if any input changes this will rerun my_process for all inputs. – Ben Braun Jul 09 '20 at 22:58
  • 1
    Sorry, forgot to replace `$(SOURCES)` with `$?`. Now it only rebuilds the changed files. For a balanced parallelism, e.g. with that many processes as you have cores, it is necessary to create `lastrunX` targets on the fly with their prerequisite list divided and separated and hope that `make` does a good enough job at parallelizing. All in all this starts to look like `make` is simply the wrong tool for this job, although I totally admit that a better build system SHOULD have no problem with your requirements. – Vroomfondel Jul 10 '20 at 09:07
  • I'll add your example to the feature list of the `make`-successor I'm secretly building, thanks! ;) – Vroomfondel Jul 10 '20 at 09:09
  • 1
    Including the documentation on `$?`: The names of all the prerequisites that are newer than the target, with spaces between them. If the target does not exist, all prerequisites will be included. For prerequisites which are archive members, only the named member is used (see Archives). ` This works, thanks! – Ben Braun Jul 10 '20 at 21:40
1

This is what I ended up going with, a makefile with one layer of recursion.

I tried using $? both with grouped and ungrouped targets, but couldn't get the exact behavior needed. If one of the output targets was deleted, the rule would be re-run but $? wouldn't necessarily have some input files but not the correct corresponding input file, very strange.

Makefile:

all:

INDIR=in
OUTDIR=out

INFILES=$(wildcard in/*)
OUTFILES=$(patsubst in/%, out/%, $(INFILES))

ifdef FIRST_PASS
#Discover which input-output pairs are out of date
$(shell mkdir -p $(OUTDIR); echo -n > $(OUTDIR)/.needs_rebuild)
$(OUTFILES) : out/% : in/%
    @echo $@ $^ >> $(OUTDIR)/.needs_rebuild

all: $(OUTFILES)
    @echo -n
else
#Recurse to run FIRST_PASS, builds .needs_rebuild:
$(shell $(MAKE) -f $(CURDIR)/$(firstword $(MAKEFILE_LIST)) FIRST_PASS=1)
#Convert .needs_rebuild into batches, creates all_batches phony target for convenience
$(shell cat $(OUTDIR)/.needs_rebuild | ./make_batches.sh 32 > $(OUTDIR)/.batches)
-include $(OUTDIR)/.batches

batch%:
    #In this rule, $^ is all inputs needing rebuild.
    #The corresponding utputs can be computed using a patsubst:
    targets="$(patsubst in/%, out/%, $^)"; touch $$targets

clean:
    rm -rf $(OUTDIR)

all: all_batches

endif

make_batches.sh:

#!/bin/bash
set -beEu -o pipefail

batch_size=$1

function _make_batches {
    batch_num=$1
    shift 1
    #echo ".PHONY: batch$batch_num"
    echo "all_batches: batch$batch_num"
    while (( $# >= 1 )); do
        read out in <<< $1
        shift 1
        echo "batch$batch_num: $in"
        echo "$out: batch$batch_num"
    done
}
export -f _make_batches

echo ".PHONY: all_batches"

parallel -N$batch_size -- _make_batches {#} {} \;

Unfortunately, the makefile is a one trick pony and there's quite a bit of boilerplate to pull this recipe off.

Ben Braun
  • 418
  • 4
  • 10
  • 1
    Off-topic: For your scenario, is this really simpler than making my_process parallelize the processing? For example using forks, having one fork for each input/output pair. – Andreas Jul 11 '20 at 08:55