2

Use case: id-10T proofing data removal with a zero-trust command.

I am looking through the documentation and I don't see clear cut guidelines for what can possibly go into DVC as a file name.

Right now, I know that DVC implements some name filtration. I cannot, for example, add a file with a newline:

$: touch 'foo
bar.txt'
$: dvc add foo$'\n'bar.txt
Adding...
ERROR: output 'foobar.txt' does not exist

Can someone point me to the documentation that explains exactly what is allowed to go into the yaml file as a path?

Chris
  • 28,822
  • 27
  • 83
  • 158
  • Can you expand on the use case though? How do file names relate to data removal? What do you mean by zero-trust command? Thanks – Jorge Orpinel Pérez Dec 29 '22 at 01:34
  • So, with "zero trust", I am trying to efficiently say that someone might try to check in a goofy filename. If I write a script that doesn't handle that goofy file name properly, `rm` runs off and does something unintentional. Ideally, I can handle or guarantee that only a certain restriction of all possible file names is going to hit my scripts. Or, I spend the extra effort and write something more robust. – Chris Jan 18 '23 at 20:55
  • On the use case, I have a local cache + checkout of files that I'd like to automatically destroy or "time out", and it makes sense. There are an arbitrary number of situations that make this use case (destroying everything local) important. The specifics are, imo, out of scope. – Chris Jan 18 '23 at 20:57

1 Answers1

2

There is no documentation on allowed filenames in DVC, but the issue is that DVC currently uses urllib.urlsplit and urllib.urlunsplit when normalizing path names, and the newline gets removed by urlsplit since it's not a valid path character for RFC-compliant URLs. DVC needs to support both local paths and remote URL paths like s3://bucket/object/path, so currently it treats everything as a URL.

The intended behavior is that DVC should support any character that is valid for your local filesystem, so it seems pretty clear that this is a bug - DVC should account for invalid URL characters that are valid for local filesystems. I've opened a report which you can follow for further updates: https://github.com/iterative/dvc-objects/issues/177

pmrowla
  • 231
  • 2
  • 3