0

I write many R scripts producing output that is later used as input to another R script. For example, I may write script a.R, save resulting table as an a-result.rda object, write script b.R, import the a-result.rda object, modify the table, maybe plot some graphs, save new table as table.csv, and so on..

I find it cumbersome to identify the R script (here a.R) which the particular table.csv (indirectly) originated from.

So far, my solution is a shortcut function in my .bashrc that allows me to pinpoint the immediate parent of an output object:

findinR(){ find . -iname '*.R' -exec grep -r $1 {} + ; }

If I am looking for the origin of table.csv, I can do findinR table.csv, and as output I will get the path of the b.R script, along with the corresponding write.csv(.. line. Then, in b.R script I notice that table.csv is a modified version of a-result.rda, so I do findinR a-result.rda and I find script a.R. This is working for 'shallow' I/O dependencies, but makes me want to kill myself when dealing with more layers.

Does anyone use/know of a text-based system or software that would allow me to produce (automatically or at least semi-automatically) I/O flowcharts or pipelines? So that I can record the history of the files generated during analysis?

EDIT: Some additional, potentially crucial details:

  1. I am not particularly interested in reporting tools or rerunning the entire analysis in one go (for smaller projects, I use knitr, but for more complex dependency-heavy workflows it's too cumbersome). I work with genomic data, which makes it impossible to rerun parts of the analysis.
  2. The input / output scripts/files have unstandardized names. I almost never repeat the same steps, so templates wouldn't help much.
  3. The final output can be anything, not only a .csv file, but most often it's .rds or .rdata / .rda.
  4. The only thing I need is a (semi-)automatics way to record the workflow, not necessarily rerun it with a different input.

EDIT2: I tried automatically generating txt files by grepping lines that incude save(), load(), write.csv(), read.csv(), pdf(), etc. However, I often use paste to generate my file names, so the corresponding lines in the code are not descriptive enough to be able to identify the file.

antass
  • 99
  • 7
  • Some good ideas here http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing. Also you could write lines starting with # in your output csv files. These can be ignored when reading back into R. See `?read.csv` specifically the option comment.char – infominer Apr 25 '14 at 21:54
  • Thanks for the link. Unfortunately, the thread is 4 years old and answers a slightly different question :( I am not so much interested in reporting tools per se (in a fresher thread `knitr` would replace makefiles and Sweave) nor rerunning code for the entire analysis. I'll update my question with more details. Thanks! – antass Apr 26 '14 at 15:18
  • 1
    Interesting question. I would not try to grep the source codes, it looks highly complicated, messy and prone to errors. What about generating a new file `example.version` every time that you create a `example.csv `, `example.rds` or `example.rdata`? This file would be a one liner that would include the script name and useful info, such as the date of creation, the git/csv version of the repo (if you use those), the time required to generate the csv file... It looks like this would solve your problem in a relatively easy way. You could write wrappers for it, like `my.save()`, `my.write.csv()`.. – Jealie Apr 26 '14 at 22:04

0 Answers0