The closest thing I can think of that could be seen as some kind of convention for doing this are Linux configuration files. In modern Linux, you often split the configuration of a service into multiple files residing in a certain directory, e.g. /etc/exim4/conf.d/
instead of having a single file /etc/exim/exim4.conf
. There are multiple reasons doing this:
- Some configuration may be provided by the package manager (e.g. linking to other services that are installed via package manager), while other parts are user-defined. Since there would be a conflict if the user edits a file provided by the package manager, they can instead create a new file and enter additional configuration there.
- For large configuration files (like for exim4), it's easier to navigate the configuration if you have multiple files for different concerns (hardcore vim users might disagree).
- You can enable / disable parts of the configuration easier by renaming / moving the file that contains a certain part.
We can learn a bit from this: Separation into distinct files should happen if the semantic of the content is orthogonal, i.e. the semantic of one file does not depend on the semantic of another file. This is of course a rule for sibling files; we cannot really deduct rules for serializing a tree structure as directory tree from it. However, we can definitely see reasons for not splitting every value in an own file.
You mention problems of encoding special characters into a file name. You will only have this problem if you go against conventions! The implicit convention on file and directory names is that they act as locator / ID for files, never as content. Again, we can learn a bit from Linux config files: Usually, there is a master file that contains an include statement which loads all the split files. The include statement gives a path glob expression which locates the other files. The path to those files is irrelevant for the semantics of their content. Technically, we can do something similar with YAML.
Assume we want to split this single YAML file into multiple files (pardon my lack of creativity):
spam:
spam: spam
egg: sausage
baked beans:
- spam
- spam
- bacon
A possible transformation would be this (read stuff ending with /
as directory, :
starts file content):
confdir/
main.yaml:
spam: !include spammap/main.yaml
baked beans: !include beans/
spammap/
main.yaml:
spam: !include spam.yaml
egg: !include egg.yaml
spam.yaml:
spam
egg.yaml:
sausage
beans/
1.yaml:
spam
2.yaml:
spam
3.yaml:
bacon
(In YAML, !include
is a local tag. With most implementations, you can register a custom constructor for it, thus loading the whole hierarchy as single document.)
As you can see, I put every hierarchy level and every value into a separate file. I use two kinds of includes: A reference to a file will load the content of that file; a reference to a directory will generate a sequence where each item's value is the content of one file in that directory, sorted by file name. As you can see, the file and directory names are never part of the content, sometimes I opted to name them differently (e.g. baked beans
-> beans/
) to avoid possible file system problems (spaces in filenames in this case – usually not a serious problem nowadays). Also, I adhere to the filename extension convention (having the files carry .yaml
). This would be more quirky if you put content into the file names.
I named the starting file on each level main.yaml
(not needed in beans/
since it's a sequence). While the exact name is arbitrary, this is a convention used in several other tools, e.g. Python with __init__.py
or the Nix package manager with default.nix
. Then I placed additional files or directories besides this main file.
Since including other files is explicit, it is not a problem with this approach to put a larger part of the content into a single file. Note that JSON lacks YAML's tags functionality, but you can still walk through a loaded JSON file and preprocess values like {"!include": "path"}
.
To sum up: While there is not directly a convention how to do what you want, parts of the problem have been solved at different places and you can inherit wisdom from that.
Here's a minimal working example of how to do it with PyYAML. This is just a proof of concept; several features are missing (e.g. autogenerated file names will be ascending numbers, no support for serializing lists into directories). It shows what needs to be done to store information about the data layout while being transparent to the user (data can be accessed like a normal dict structure). It remembers file names stuff has been loaded from and stores to those files again.
import os.path
from pathlib import Path
import yaml
from yaml.reader import Reader
from yaml.scanner import Scanner
from yaml.parser import Parser
from yaml.composer import Composer
from yaml.constructor import SafeConstructor
from yaml.resolver import Resolver
from yaml.emitter import Emitter
from yaml.serializer import Serializer
from yaml.representer import SafeRepresenter
class SplitValue(object):
"""This is a value that should be written into its own YAML file."""
def __init__(self, content, path = None):
self._content = content
self._path = path
def getval(self):
return self._content
def setval(self, value):
self._content = value
def __repr__(self):
return self._content.__repr__()
class TransparentContainer(object):
"""Makes SplitValues transparent to the user."""
def __getitem__(self, key):
val = super(TransparentContainer, self).__getitem__(key)
return val.getval() if isinstance(val, SplitValue) else val
def __setitem__(self, key, value):
val = super(TransparentContainer, self).__getitem__(key)
if isinstance(val, SplitValue) and not isinstance(value, SplitValue):
val.setval(value)
else:
super(TransparentContainer, self).__setitem__(key, value)
class TransparentList(TransparentContainer, list):
pass
class TransparentDict(TransparentContainer, dict):
pass
class DirectoryAwareFileProcessor(object):
def __init__(self, path, mode):
self._basedir = os.path.dirname(path)
self._file = open(path, mode)
def close(self):
try:
self._file.close()
finally:
self.dispose() # implemented by PyYAML
# __enter__ / __exit__ to use this in a `with` construct
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
self.close()
class FilesystemLoader(DirectoryAwareFileProcessor, Reader, Scanner,
Parser, Composer, SafeConstructor, Resolver):
"""Loads YAML file from a directory structure."""
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'r')
Reader.__init__(self, self._file)
Scanner.__init__(self)
Parser.__init__(self)
Composer.__init__(self)
SafeConstructor.__init__(self)
Resolver.__init__(self)
def split_value_constructor(loader, node):
path = loader.construct_scalar(node)
with FilesystemLoader(os.path.join(loader._basedir, path)) as childLoader:
return SplitValue(childLoader.get_single_data(), path)
FilesystemLoader.add_constructor(u'!include', split_value_constructor)
def transp_dict_constructor(loader, node):
ret = TransparentDict()
ret.update(loader.construct_mapping(node, deep=True))
return ret
# override constructor for !!map, the default resolved tag for mappings
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
transp_dict_constructor)
def transp_list_constructor(loader, node):
ret = TransparentList()
ret.append(loader.construct_sequence(node, deep=True))
return ret
# like above, for !!seq
FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_SEQUENCE_TAG,
transp_list_constructor)
class FilesystemDumper(DirectoryAwareFileProcessor, Emitter,
Serializer, SafeRepresenter, Resolver):
def __init__(self, path):
DirectoryAwareFileProcessor.__init__(self, path, 'w')
Emitter.__init__(self, self._file)
Serializer.__init__(self)
SafeRepresenter.__init__(self)
Resolver.__init__(self)
self.__next_unique_name = 1
Serializer.open(self)
def gen_unique_name(self):
val = self.__next_unique_name
self.__next_unique_name = self.__next_unique_name + 1
return str(val)
def close(self):
try:
Serializer.close(self)
finally:
DirectoryAwareFileProcessor.close(self)
def split_value_representer(dumper, data):
if data._path is None:
if isinstance(data._content, TransparentContainer):
data._path = os.path.join(dumper.gen_unique_name(), "main.yaml")
else:
data._path = dumper.gen_unique_name() + ".yaml"
Path(os.path.dirname(data._path)).mkdir(exist_ok=True)
with FilesystemDumper(os.path.join(dumper._basedir, data._path)) as childDumper:
childDumper.represent(data._content)
return dumper.represent_scalar(u'!include', data._path)
yaml.add_representer(SplitValue, split_value_representer, FilesystemDumper)
def transp_dict_representer(dumper, data):
return dumper.represent_dict(data)
yaml.add_representer(TransparentDict, transp_dict_representer, FilesystemDumper)
def transp_list_representer(dumper, data):
return dumper.represent_list(data)
# example usage:
# explicitly specify values that should be split.
myData = TransparentDict({
"spam": SplitValue({
"spam": SplitValue("spam", "spam.yaml"),
"egg": SplitValue("sausage", "sausage.yaml")}, "spammap/main.yaml")})
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(myData)
# load values from stored files.
# The loaded data remembers which values have been in which files.
with FilesystemLoader("root.yaml") as loader:
loaded = loader.get_single_data()
# modify a value as if it was a normal structure.
# actually updates a SplitValue
loaded["spam"]["spam"] = "baked beans"
# dumps the same structure as before, with the modified value.
with FilesystemDumper("root.yaml") as dumper:
dumper.represent(loaded)