How to keep track of input / output files in R

Question

I write many R scripts producing output that is later used as input to another R script. For example, I may write script a.R, save resulting table as an a-result.rda object, write script b.R, import the a-result.rda object, modify the table, maybe plot some graphs, save new table as table.csv, and so on..

I find it cumbersome to identify the R script (here a.R) which the particular table.csv (indirectly) originated from.

So far, my solution is a shortcut function in my .bashrc that allows me to pinpoint the immediate parent of an output object:

findinR(){ find . -iname '*.R' -exec grep -r $1 {} + ; }

If I am looking for the origin of table.csv, I can do findinR table.csv, and as output I will get the path of the b.R script, along with the corresponding write.csv(.. line. Then, in b.R script I notice that table.csv is a modified version of a-result.rda, so I do findinR a-result.rda and I find script a.R. This is working for 'shallow' I/O dependencies, but makes me want to kill myself when dealing with more layers.

Does anyone use/know of a text-based system or software that would allow me to produce (automatically or at least semi-automatically) I/O flowcharts or pipelines? So that I can record the history of the files generated during analysis?

EDIT: Some additional, potentially crucial details:

I am not particularly interested in reporting tools or rerunning the entire analysis in one go (for smaller projects, I use knitr, but for more complex dependency-heavy workflows it's too cumbersome). I work with genomic data, which makes it impossible to rerun parts of the analysis.
The input / output scripts/files have unstandardized names. I almost never repeat the same steps, so templates wouldn't help much.
The final output can be anything, not only a .csv file, but most often it's .rds or .rdata / .rda.
The only thing I need is a (semi-)automatics way to record the workflow, not necessarily rerun it with a different input.

EDIT2: I tried automatically generating txt files by grepping lines that incude save(), load(), write.csv(), read.csv(), pdf(), etc. However, I often use paste to generate my file names, so the corresponding lines in the code are not descriptive enough to be able to identify the file.

Some good ideas here http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing. Also you could write lines starting with # in your output csv files. These can be ignored when reading back into R. See `?read.csv` specifically the option comment.char — infominer, Apr 25 '14 at 21:54
Thanks for the link. Unfortunately, the thread is 4 years old and answers a slightly different question :( I am not so much interested in reporting tools per se (in a fresher thread `knitr` would replace makefiles and Sweave) nor rerunning code for the entire analysis. I'll update my question with more details. Thanks! — antass, Apr 26 '14 at 15:18
Interesting question. I would not try to grep the source codes, it looks highly complicated, messy and prone to errors. What about generating a new file `example.version` every time that you create a `example.csv `, `example.rds` or `example.rdata`? This file would be a one liner that would include the script name and useful info, such as the date of creation, the git/csv version of the repo (if you use those), the time required to generate the csv file... It looks like this would solve your problem in a relatively easy way. You could write wrappers for it, like `my.save()`, `my.write.csv()`.. — Jealie, Apr 26 '14 at 22:04

How to keep track of input / output files in R

0 Answers0