make for reproducible research

Question

Make is handy for making research and data analysis with dependencies more reproducible, e.g.:

# make file
R = R CMD BATCH --no-save --no-restore 
datafiles = *.csv
outputfiles = *.{pdf,Rout}

.PHONY: all clean

all: fig_A.pdf fig_B.pdf 

clean: 
    rm -f $(datafiles) $(outputfiles)
    rm -rf output
    mkdir output

# produce outputs
fig_A.pdf fig_B.pdf: interim_data.csv plot_figs.R
    $(R) plot_figs.R
    mv plot_figs.Rout ./output

# derive interim data
interim_data.csv: source_data.csv source_to_interim.R 
    $(R) source_to_interim.R 
    mv source_to_interim.Rout ./output

# download source data
source_data.csv: download_source.R
    $(R) download_source.R 
    mv download_source.Rout ./output

Regenerates the figures from source data, saving all outputs to ./output. But can we make things more compact? e.g., by

Avoiding repetitions, as in:
```
$(R) script.R
mv script.Rout ./output
```
Reorganizing to more generically relate code (R scripts in this example) data (csv) and outputs (pdf, Rout)?
Better handling the export of outputs to the ./output directory?

Renaud Pacalet · Accepted Answer · 2018-07-31T12:43:14.023

1) and 2)

You should probably look at make's automatic variables:

$ cat Makefile
.NOTPARALLEL:

OUTPUT := output
R      = R CMD BATCH --no-save --no-restore
PDF    := fig_A.pdf fig_B.pdf
CSV    := interim_data.csv source_data.csv

all: $(PDF) $(CSV)

$(PDF): plot_figs.R interim_data.csv
interim_data.csv: source_to_interim.R source_data.csv
source_data.csv: download_source.R

$(CSV) $(PDF):
    $(R) $<
    mv $<out $(OUTPUT)

$ make
R CMD BATCH --no-save --no-restore download_source.R
mv download_source.Rout output
R CMD BATCH --no-save --no-restore source_to_interim.R
mv source_to_interim.Rout output
R CMD BATCH --no-save --no-restore plot_figs.R
mv plot_figs.Rout output

The $< automatic variable is expanded by make as the first prerequisite of the current target (this is why I reordered the prerequisites of fig_A.pdf, fig_B.pdf and interim_data.csv). Moreover, you can separate the rule with the recipe and the rules with the prerequisites (and no recipe).

Note the .NOTPARALLEL that tells make not to run several recipes in parallel. In your case it is needed because you have two targets (fig_A.pdf and fig_B.pdf) producing the same plot_figs.Rout side product that gets moved out by the same recipe. If make was allowed to run in parallel mode there would be a risk of race condition.

3)

This is a bit more difficult because your recipes produce 2 different outputs: *.csv (or *.pdf) and *.Rout. And make has not been designed with this case in mind. It is more oriented towards one recipe = one file product. But we can try to hide these file moves using a macro (R):

$ cat Makefile
.NOTPARALLEL:

OUTPUT := output
R      = R CMD BATCH --no-save --no-restore $(1) && mv $(1)out $(OUTPUT)
PDF    := fig_A.pdf fig_B.pdf
CSV    := interim_data.csv source_data.csv

all: $(PDF) $(CSV)

$(PDF): plot_figs.R interim_data.csv
interim_data.csv: source_to_interim.R source_data.csv
source_data.csv: download_source.R

$(CSV) $(PDF):
    $(call R,$<)

$ make
R CMD BATCH --no-save --no-restore download_source.R && mv download_source.Rout output
R CMD BATCH --no-save --no-restore source_to_interim.R && mv source_to_interim.Rout output
R CMD BATCH --no-save --no-restore plot_figs.R && mv plot_figs.Rout output

The $(call...) make function expands as the value of its first parameter variable (R) where $(1) has been replaced by the second parameter ($<), $(2) by the third parameter (none in our case),...

Note the definition of R: it uses the recursive assignment operator (=), not the simple assignment operator (:=) because we want it to be expanded only when needed, just before make passes the recipe to the shell for execution.

make for reproducible research

1 Answers1

1) and 2)

3)