0

For my project, sometimes restructuring, or simply changing the mount point of my project data directory is required (Eg - Upgrading to catalina and no longer being able to have non-standard subdirectories of / ) .

I've noticed that, even though the contents of the input directories don't change, changing the path prefix to the common components will invalidate all targets.

Is there a way to avoid this?

  • Related: you may be interested in dynamic files: https://github.com/ropensci/drake/pull/1178. Brand new in development `drake` (the GitHub version, `remotes::install_github("ropensci/drake")). – landau Feb 22 '20 at 13:31

1 Answers1

1

My main recommendation here is to use relative paths instead of absolute paths. If you have ever used the here package, it is the same idea. But instead of writing file.path(here::here(), "path/to/file.txt"), I recommend writing file_in("path/to/file.txt") in the plan, assuming you intend to call drake::make() when your working directory is path.

That's for future reference. In your current situation right now, if you are absolutely sure all the files are up to date and you don't want to spend time rebuilding targets, then you can use make(plan, trigger = trigger(command = FALSE, file = FALSE) to tell drake to stop worrying about whether commands or files change. (Why commands? Because that's where the file_in() calls will be, and I assume you are changing the paths inside.)

Edit

I realize now that I did not fully understand your question the first time. But since I also work with data in a similar way as you, I think there is an answer. Say you have a plan like this:

plan <- drake_plan(
  data = get_data(file_in("DRIVE_NAME/file.db"))
)

And your mount point changes, making it look like this:

plan <- drake_plan(
  data = get_data(file_in("DIFFERENT_MOUNT_POINT/file.db"))
)

As you noted, the struggle comes from that changing path. What you can do here manually track the file using the "change" trigger. That way, we don't need file_in(). Second, use ignore() around the changing path so drake thinks the command stays the same. No superfluous invalidation when you change mount points.

plan <- drake_plan(
  data = target(
    get_data(ignore("WHATEVER_MOUNT_POINT/file.db")),
    trigger = trigger(change = file.mtime("WHATEVER_MOUNT_POINT/file.db"))
  ) 
)

Now, whenever the modification time changes, the data gets invalidated. But you can change WHATEVER_MOUNT_POINT without incurring invalidation. I would ordinarily choose a file hash for the trigger (that's what file_in() tells drake to do as a last result) but I chose the time stamp for you because file.mtime() is fast, your data is large, and it hardly ever changes.

landau
  • 5,636
  • 1
  • 22
  • 50
  • Would that suggestion work with symlinks/hardlinks? I work with sensitive data with restrictive storage mandated by ethics, so it's supposed to be stored externally, meaning I'm not able to physically store it relative to the project working directory. Also, thank you for the `file` suggestion, generally I think that will probably speed up my workflow since my raw inputs rarely change and are in the order of 60GB. Would adding an optional `prefix` argument to `file_in` that's common to all input (Or a `file_prefix` to `make`/`config`) be a feature you'd consider? – Matthew Strasiotto Dec 13 '19 at 04:14
  • Yeah, I also work with sensitive/restricted data that is tightly locked down and rarely changes. `file_in()` on a local symlink is worth a shot, but I have not tried it myself. I am reluctant to implement prefixing, the payoff seems small even in situations like ours. I will edit the answer with another suggestion... – landau Dec 13 '19 at 05:13
  • Please see the edit above. I think it does what you need. – landau Dec 13 '19 at 05:22
  • Thank you for the edit that solves my use-case perfectly, and for the `file.mtime()` suggestion, as I think hashing the raw inputs (which hardly ever change) was one of the slowest parts of my workflow. I've suggested a minor edit to the answer to remove the `suffix` argument from the `get_data` dummy call, to make the "before" example match the "after" example. – Matthew Strasiotto Dec 16 '19 at 04:38