How do I read dynamic files in drake?

Question

I want use drake's dynamic targets to read multiple files. I wrote the following plan based on my understanding of how dynamic files work. However, when the input file changes, drake does not correctly update all targets.

What is the correct way to use drake's dynamic files to read files?

In other words, what is the dynamic files version of file_in() to solve this problem: How can I import from multiple files in r-drake?

library(drake)
library(tidyverse)

content <- tibble(x1 = 1, x2 = 1)
walk(list("a", "b"), ~ write_csv(x = content, path = paste0(., ".csv")))
read_csv("b.csv", col_types = "dd")
#> # A tibble: 1 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1

plan <- drake::drake_plan(
  import_paths = target(c(
    a = "a.csv",
    b = "b.csv"
  ),
  format = "file"
  ),

  data = target(
    read_csv(import_paths, col_types = "dd"),
    dynamic = map(import_paths)
  )
)

drake::make(plan)
#> ▶ target import_paths
#> ▶ dynamic data
#> > subtarget data_44119303
#> > subtarget data_ecc6ebe6
#> ■ finalize data
readd(data)
#> # A tibble: 2 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     1     1

walk(list("b"), ~ write_csv(x = content + 1, path = paste0(., ".csv")))
read_csv("b.csv", col_types = "dd")
#> # A tibble: 1 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     2     2

drake::make(plan)
#> ▶ target import_paths
#> ■ finalize data
readd(data)
#> # A tibble: 2 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     1     1

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

landau · Accepted Answer · 2020-08-07T00:52:13.373

Perhaps this is not obvious, but dynamic file targets are irreducible. If c("a.csv", "b.csv") is your dynamic file, you cannot break it up into "a.csv" and " b.csv". drake stores a global hash of all those files together, and it does not keep track of the hashes or timestamps on a file by file basis. This helps drake stay efficient even if you return a large number of dynamic files from a single target.

The solution is to make "a.csv" and "b.csv" two different dynamic file targets using a dynamic map(). You need an extra target at the beginning just to contain the path names, but it gets the job done.

library(drake)
library(tidyverse)

content <- tibble(x1 = 1, x2 = 1)
walk(list("a", "b"), ~ write_csv(x = content, path = paste0(., ".csv")))
read_csv("b.csv", col_types = "dd")
#> # A tibble: 1 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1

plan <- drake_plan(
  import_paths = c("a.csv", "b.csv"),
  import_files = target(
    import_paths,
    format = "file",
    dynamic = map(import_paths)
  ),
  data = target(
    read_csv(import_files, col_types = "dd"),
    dynamic = map(import_files)
  )
)

make(plan)
#> ▶ target import_paths
#> ▶ dynamic import_files
#> > subtarget import_files_4209ea92
#> > subtarget import_files_b8419eb2
#> ■ finalize import_files
#> ▶ dynamic data
#> > subtarget data_b59aea49
#> > subtarget data_e6b8ef3e
#> ■ finalize data

readd(data)
#> # A tibble: 2 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     1     1

walk(list("b"), ~ write_csv(x = content + 1, path = paste0(., ".csv")))
read_csv("b.csv", col_types = "dd")
#> # A tibble: 1 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     2     2

make(plan)
#> ▶ dynamic import_files
#> > subtarget import_files_b8419eb2
#> ■ finalize import_files
#> ▶ dynamic data
#> > subtarget data_a0f1c4f0
#> ■ finalize data

readd(data)
#> # A tibble: 2 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

Thank you for the help and explanation! Indeed, I did not know that dynamic file targets are irreducible. I do not understand your explanation yet: Are you saying that when drake returns a large number of dynamic file targets from a single target, then drake does not watch the timestamps/hashes of each of these files on disk? If so, then what is the benefit of using dynamic file targets over regular targets of character vectors of file paths in such a case? — robust, Aug 07 '20 at 00:57
In the case of lots of files, drake computes a composite hash and timestamp for the whole collection. So it watches everything at once and reruns the dynamic file target if any one of its files changes. — landau, Aug 07 '20 at 01:24
But then why does drake not reread both `a.csv` and `b.csv` in the code of my question? If `b.csv` changes, then the composite hash should change as well. This should trigger a rerun of the dynamic file target and the `data` target, which depends on the dynamic file target. — robust, Aug 07 '20 at 03:48
Good point. We should either (1) throw an error if `import_paths` does not already use dynamic branching, or (2) invalidate all of `data`. — landau, Aug 07 '20 at 14:37
I think (1) is best. What users actually want here is to invalidate some sub-targets but not others, and this is impossible in the OP because of the composite hash. — landau, Aug 07 '20 at 14:48

How do I read dynamic files in drake?

1 Answers1