Store json-like hierarchical data as nested directory tree?

Question

TLDR

I am looking for an existing convention to encode / serialize tree-like data in a directory structure, split into small files instead of one big file.

Background

There are different scenarios where we want to store tree-like data in a file, which can then be tracked in git. Json files can express dependencies for a package manager (e.g. composer for php, npm for node.js). Yml files can define routes, test cases, etc.

Typically a "tree structure" is a combination of key-value lists and "serial" lists, where each value can again be a tree structure.

Very often the order of associative keys is irrelevant, and should ideally be normalized to alphabetic order.

One problem when storing a big tree structure in a single file, be it json or yml, which is then tracked with git, is that you get plenty of merge conflicts if different branches add and remove entries in the same key-value list.

Especially for key-value lists where the order is irrelevant, it would be more git-friendly to store each sub-tree in a separate file or directory, instead of storing them all in one big file.

Technically it should be possible to create a directory structure that is as expressive as json or yml.

Performance concerns can be overcome with caching. If the files are going to be tracked in git, we can assume they are going to be unchanged most of the time.

The main challenges: - How to deal with "special characters" that cause problems in some or most file systems, if used in a file or directory name? - If I need to encode or disambiguate special characters, how can I still keep it pleasant to the eye? - How to deal with limitations to file name length in some file systems? - How to deal with other file system quirks, e.g. case insensitivity? Is this even still a thing? - How to express serial lists, which might contain key-value lists as children? Serial lists cannot be expressed as directories, so its children have to live within the same file. - How can I avoid to reinvent the wheel, creating my own made-up "convention" that nobody else uses?

Desired features: - As expressive as json or yml. - git-friendly. - Machine-readable and -writable. - Human-readable and -editable, perhaps with limitations. - Ideally it should use known formats (json, yml) for structures and values that are expressed within a single file.

Naive approach

Of course the first idea would be to use yml files for literal values and serial lists, and directories for key-value lists (in cases where the order does not matter). In a key-value list, the file or directory names are interpreted as keys, the files and subdirectories as values.

This has some limitations, because not every possible key that would be valid in json or yml is also a valid file name in every file system. The most obvious example would be a slash.

Question

I have different ideas how I would do this myself.

But I am really looking for some kind of convention for this that already exists.

Related questions

Persistence: Data Trees stored as Directory Trees
This is asking about performance, and about using the filesystem like a database - I think.
I am less interested in performance (caching makes it irrelevant), and more about the actual storage format / convention.

flyx · Accepted Answer · 2019-12-31T10:18:58.783

The closest thing I can think of that could be seen as some kind of convention for doing this are Linux configuration files. In modern Linux, you often split the configuration of a service into multiple files residing in a certain directory, e.g. /etc/exim4/conf.d/ instead of having a single file /etc/exim/exim4.conf. There are multiple reasons doing this:

Some configuration may be provided by the package manager (e.g. linking to other services that are installed via package manager), while other parts are user-defined. Since there would be a conflict if the user edits a file provided by the package manager, they can instead create a new file and enter additional configuration there.
For large configuration files (like for exim4), it's easier to navigate the configuration if you have multiple files for different concerns (hardcore vim users might disagree).
You can enable / disable parts of the configuration easier by renaming / moving the file that contains a certain part.

We can learn a bit from this: Separation into distinct files should happen if the semantic of the content is orthogonal, i.e. the semantic of one file does not depend on the semantic of another file. This is of course a rule for sibling files; we cannot really deduct rules for serializing a tree structure as directory tree from it. However, we can definitely see reasons for not splitting every value in an own file.

You mention problems of encoding special characters into a file name. You will only have this problem if you go against conventions! The implicit convention on file and directory names is that they act as locator / ID for files, never as content. Again, we can learn a bit from Linux config files: Usually, there is a master file that contains an include statement which loads all the split files. The include statement gives a path glob expression which locates the other files. The path to those files is irrelevant for the semantics of their content. Technically, we can do something similar with YAML.

Assume we want to split this single YAML file into multiple files (pardon my lack of creativity):

spam:
  spam: spam
  egg: sausage
baked beans:
- spam
- spam
- bacon

A possible transformation would be this (read stuff ending with / as directory, : starts file content):

confdir/
  main.yaml:
    spam: !include spammap/main.yaml
    baked beans: !include beans/
  spammap/
    main.yaml:
      spam: !include spam.yaml
      egg: !include egg.yaml
    spam.yaml:
      spam
    egg.yaml:
      sausage
  beans/
    1.yaml:
      spam
    2.yaml:
      spam
    3.yaml:
      bacon

(In YAML, !include is a local tag. With most implementations, you can register a custom constructor for it, thus loading the whole hierarchy as single document.)

As you can see, I put every hierarchy level and every value into a separate file. I use two kinds of includes: A reference to a file will load the content of that file; a reference to a directory will generate a sequence where each item's value is the content of one file in that directory, sorted by file name. As you can see, the file and directory names are never part of the content, sometimes I opted to name them differently (e.g. baked beans -> beans/) to avoid possible file system problems (spaces in filenames in this case – usually not a serious problem nowadays). Also, I adhere to the filename extension convention (having the files carry .yaml). This would be more quirky if you put content into the file names.

I named the starting file on each level main.yaml (not needed in beans/ since it's a sequence). While the exact name is arbitrary, this is a convention used in several other tools, e.g. Python with __init__.py or the Nix package manager with default.nix. Then I placed additional files or directories besides this main file.

Since including other files is explicit, it is not a problem with this approach to put a larger part of the content into a single file. Note that JSON lacks YAML's tags functionality, but you can still walk through a loaded JSON file and preprocess values like {"!include": "path"}.

To sum up: While there is not directly a convention how to do what you want, parts of the problem have been solved at different places and you can inherit wisdom from that.

Here's a minimal working example of how to do it with PyYAML. This is just a proof of concept; several features are missing (e.g. autogenerated file names will be ascending numbers, no support for serializing lists into directories). It shows what needs to be done to store information about the data layout while being transparent to the user (data can be accessed like a normal dict structure). It remembers file names stuff has been loaded from and stores to those files again.

import os.path
from pathlib import Path

import yaml
from yaml.reader import Reader
from yaml.scanner import Scanner
from yaml.parser import Parser
from yaml.composer import Composer
from yaml.constructor import SafeConstructor
from yaml.resolver import Resolver
from yaml.emitter import Emitter
from yaml.serializer import Serializer
from yaml.representer import SafeRepresenter

class SplitValue(object):
  """This is a value that should be written into its own YAML file."""

  def __init__(self, content, path = None):
    self._content = content
    self._path = path

  def getval(self):
    return self._content

  def setval(self, value):
    self._content = value

  def __repr__(self):
    return self._content.__repr__()

class TransparentContainer(object):
  """Makes SplitValues transparent to the user."""

  def __getitem__(self, key):
    val = super(TransparentContainer, self).__getitem__(key)
    return val.getval() if isinstance(val, SplitValue) else val

  def __setitem__(self, key, value):
    val = super(TransparentContainer, self).__getitem__(key)
    if isinstance(val, SplitValue) and not isinstance(value, SplitValue):
      val.setval(value)
    else:
      super(TransparentContainer, self).__setitem__(key, value)

class TransparentList(TransparentContainer, list):
  pass

class TransparentDict(TransparentContainer, dict):
  pass


class DirectoryAwareFileProcessor(object):
  def __init__(self, path, mode):
    self._basedir = os.path.dirname(path)
    self._file = open(path, mode)

  def close(self):
    try:
      self._file.close()
    finally:
      self.dispose() # implemented by PyYAML

  # __enter__ / __exit__ to use this in a `with` construct
  def __enter__(self):
    return self

  def __exit__(self, type, value, traceback):
    self.close()

class FilesystemLoader(DirectoryAwareFileProcessor, Reader, Scanner,
    Parser, Composer, SafeConstructor, Resolver):
  """Loads YAML file from a directory structure."""
  def __init__(self, path):
    DirectoryAwareFileProcessor.__init__(self, path, 'r')
    Reader.__init__(self, self._file)
    Scanner.__init__(self)
    Parser.__init__(self)
    Composer.__init__(self)
    SafeConstructor.__init__(self)
    Resolver.__init__(self)

def split_value_constructor(loader, node):
  path = loader.construct_scalar(node)
  with FilesystemLoader(os.path.join(loader._basedir, path)) as childLoader:
    return SplitValue(childLoader.get_single_data(), path)

FilesystemLoader.add_constructor(u'!include', split_value_constructor)

def transp_dict_constructor(loader, node):
  ret = TransparentDict()
  ret.update(loader.construct_mapping(node, deep=True))
  return ret

# override constructor for !!map, the default resolved tag for mappings
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
    transp_dict_constructor)

def transp_list_constructor(loader, node):
  ret = TransparentList()
  ret.append(loader.construct_sequence(node, deep=True))
  return ret

# like above, for !!seq
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_SEQUENCE_TAG,
    transp_list_constructor)


class FilesystemDumper(DirectoryAwareFileProcessor, Emitter,
    Serializer, SafeRepresenter, Resolver):
  def __init__(self, path):
    DirectoryAwareFileProcessor.__init__(self, path, 'w')
    Emitter.__init__(self, self._file)
    Serializer.__init__(self)
    SafeRepresenter.__init__(self)
    Resolver.__init__(self)

    self.__next_unique_name = 1
    Serializer.open(self)

  def gen_unique_name(self):
    val = self.__next_unique_name
    self.__next_unique_name = self.__next_unique_name + 1
    return str(val)

  def close(self):
    try:
      Serializer.close(self)
    finally:
      DirectoryAwareFileProcessor.close(self)

def split_value_representer(dumper, data):
  if data._path is None:
    if isinstance(data._content, TransparentContainer):
      data._path = os.path.join(dumper.gen_unique_name(), "main.yaml")
    else:
      data._path = dumper.gen_unique_name() + ".yaml"
  Path(os.path.dirname(data._path)).mkdir(exist_ok=True)
  with FilesystemDumper(os.path.join(dumper._basedir, data._path)) as childDumper:
    childDumper.represent(data._content)
  return dumper.represent_scalar(u'!include', data._path)

yaml.add_representer(SplitValue, split_value_representer, FilesystemDumper)

def transp_dict_representer(dumper, data):
  return dumper.represent_dict(data)

yaml.add_representer(TransparentDict, transp_dict_representer, FilesystemDumper)

def transp_list_representer(dumper, data):
  return dumper.represent_list(data)

# example usage:

# explicitly specify values that should be split.
myData = TransparentDict({
  "spam": SplitValue({
    "spam": SplitValue("spam", "spam.yaml"),
    "egg": SplitValue("sausage", "sausage.yaml")}, "spammap/main.yaml")})

with FilesystemDumper("root.yaml") as dumper:
  dumper.represent(myData)

# load values from stored files.
# The loaded data remembers which values have been in which files.
with FilesystemLoader("root.yaml") as loader:
  loaded = loader.get_single_data()

# modify a value as if it was a normal structure.
# actually updates a SplitValue
loaded["spam"]["spam"] = "baked beans"
# dumps the same structure as before, with the modified value.
with FilesystemDumper("root.yaml") as dumper:
  dumper.represent(loaded)

Interesting answer! One question though: If the file and directory names no longer correspond to data keys, now they become somewhat arbitrary. E.g. "spammap" for "spam" seems like a custom made-up name. To automate this, there would need to be a reproducible and repeatable pattern to generate these names. — donquixote, Dec 30 '19 at 20:25
The nice thing about your answer: For _reading_, we are only using known conventions from the YAML standard. Only for the writing, we need an additional convention (and implementation) to determine canonical file names and directory names. Humans could create files with non-canonical names, and then use a tool to "normalize". — donquixote, Dec 30 '19 at 20:33
Ideal would be a 1:1 mapping from string key to file name / dir name. This won't fly however, because the set of possible string keys is greater than the set of possible file names and directory names (due to length limits in some file systems). This applies especially if the file or dir name should be human-readable, and less so if we add some kind of hash into the name. Sometimes a file or dir will need to be renamed depending on other entries in the same directory. We want this to be a rare case, to minimize confusion in git. — donquixote, Dec 30 '19 at 20:36
A common pattern to generate file names is to take the corresponding map key and convert them to ASCII / remove space (to be perfect, you'd need to check for collisions). There are tools for that in most languages. With most YAML implementations, you can use special types for data that should be split into another file (e.g. derive from YAMLObject in PyYAML); you then only need to provide custom representation / construction for that type. You can even store non-canonical names in the object and later save modifications in the original file. — flyx, Dec 31 '19 at 00:59
So for the naming I would do a lossy sanitization first. Then I could either disambiguate when needed e.g. by appending numbers, and/or disambiguate preemptively by appending a hash. — donquixote, Dec 31 '19 at 05:39
The "special types for data that should be split" sounds complex enough to expand on it in the answer text. I assume what you mean is to preprocess the data before saving, replacing some key-value lists with typed objects? Then on read they would be deconverted to array again? Fyi I am working with PHP mostly, but perhaps others are interested in the Python perspective, and perhaps the general idea works cross language. — donquixote, Dec 31 '19 at 05:42
Btw I always thought the file ending should be *.yml not *.yaml. But perhaps both variations exist. — donquixote, Dec 31 '19 at 05:45
I added a POC with PyYAML. I have no idea whether you can do it similarly in PHP (it depends heavily on overriding the `[]` operator); if not, you'd need a preprocessing step from original structure to structure with `SplitValue`. You'd also need to store the original layout separately when loading to ensure you safe modified data in the same way. And btw, `.yaml` is recommended by the [FAQ](https://yaml.org/faq.html). — flyx, Dec 31 '19 at 10:22