Run the same IPython notebook code on two different data files, and compare

Question

Is there a good way to modularize and re-use code in IPython Notebook (Jupyter) when doing the same analysis on two different sets of data?

For example, I have a notebook with a lot of cells doing analysis on a data file. I have another data file of the same format, and I'd like to run the same analysis and compare the output. None of these options looks particularly appealing for this:

Copy and paste the cells to a second notebook. The analysis code is now duplicated and harder to update.
Move the analysis code into a module and run it for both files. This would lose the cell-by-cell format of the figures that are currently generated and simply jumble them all together in one massive cell.
Load both files in one notebook and run the analyses side by side. This also involves a lot of copy-and-pasting, and doesn't generalize well to 3 or 4 different data files.

Is there a better way to do this?

score 3 · Answer 1 · answered Mar 30 '15 at 19:14

You could lace demo directives into the standalone module, as per the IPython Demo Mode example.

Then when actually executing it in the notebook, you make a call to the demo object wrapper each time you want to step to the next important part. So your cells would mostly consist of calls to that demo wrapper object.

Option 2 is clearly the best for code re-use, it is the de facto standard arguably in all of software engineering.

I argue that the notebook concept itself doesn't scale well to 3, 4, 5, ... different data files. Notebook presentations are not meant to be batch processing receptacles. If you find yourself needing to do parameter sweeps across different data sets, and wanting to re-run analyses on top of the different data loaded for each parameter group (even when the 'parameters' might be as simple as different file names) it raises a bad code smell. It likely means the level of analysis being performed in an 'interactive' way is wrong. Witnessing analysis 'interactively' and at the same time performing batch processing are two pretty much incompatible goals. A much better idea would be to batch process all of the parameter sets separately, 'offline' from the point of view of any presentation, and then build a set of stand-alone functions that can produce visual results from the computed and stored batch results. Then the notebook will just be a series of function calls, each of which produces summary data (some of which could be examples from a selection of parameter sets during batch processing) across all of the parameter sets at once to invite the necessary comparisons and meaningfully present the result data side-by-side.

'Witnessing' an entire interactive presentation that performs analysis on one parameter set, then changing some global variable / switching to a new notebook / running more cells in the same notebook in order to 'witness' the same presentation on a different parameter set sounds borderline useless to me, in the sense that I cannot imagine a situation where that mode of consuming the presentation is not strictly worse than consuming a targeted summary presentation that first computed results for all parameter sets of interest and assembled important results into a comparison.

Perhaps the only case I can think of would be toy pedagogical demos, like some toy frequency data and a series of notebooks that do some simple Fourier analysis or something. But that's exactly the kind of case that begs for the analysis functions to be made into a helper module, and the notebook itself just lets you selectively declare which toy input file you want to run the notebook on top of.

Run the same IPython notebook code on two different data files, and compare

1 Answers1