3

I'm building an API that deals with serializing data, and I'd prefer to add support for as much static analysis as possible. I'm inspired by Django's Meta pattern for declaring metadata on a class, and use an internal library similar to pydantic for inspecting type annotations for serialization.

I'd like the API to work something like:

class Base:
    Data: type

    def __init__(self):
        self.data = self.Data()


class Derived(Base):
    class Data:
        # Actually, more like a pydantic model.
        # Used for serialization.
        attr: str = None


obj = Derived()
type(obj.data)  # Derived.Data
obj.data.attr = 'content'

This works well and is readable, however it doesn't seem to support static analysis at all. How do I annotate self.data in Base so that I have proper type information on obj?

reveal_type(obj.data)  # Derived.Data
reveal_type(obj.data.attr)  # str

obj.data.attr = 7  # Should be static error
obj.data.other = 7  # Should be static error

I might write self.data: typing.Self.Data but this obviously doesn't work.


I was able to get something close with typing.Generic and forward references:

import typing

T = typing.TypeVar('T')

class Base(typing.Generic[T]):
    Data: type[T]

    def __init__(self):
        self.data: T = self.Data()

class Derived(Base['Derived.Data']):
    class Data:
        attr: str = None

But it's not DRY and it doesn't enforce that the annotation and runtime type actually match. For example:

class Derived(Base[SomeOtherType]):
    class Data:  # Should be static error
        attr: str = None

type(obj.data)  # Derived.Data
reveal_type(obj.data)  # SomeOtherType

I could also require the derived class provide an annotation for data, but this suffers similar issues as typing.Generic.

class Derived(Base):
    data: SomeOtherClass  # should be 'Data'

    class Data:  # should be a static error
        attr: str = None

To attempt to fix this I tried writing some validation logic in __init_subclass__ to ensure T matches cls.data; however this is brittle and doesn't work in all cases. It also forbids creating any abstract derived class which doesn't define Data.

Daniil Fajnberg
  • 12,753
  • 2
  • 10
  • 41
allemangD
  • 31
  • 5
  • Interesting idea! You could probably write a metaclass that inserts the `Base['Derived.Data']` parent for you, then you declare it as `Derived(metaclass=BaseMeta)` instead and you avoid both the repetition and the incorrect base declaration issues. – tzaman Dec 12 '22 at 20:40
  • I'm having a hard time coming up with a way to do that which mypy or PyCharm's type checker can understand... it seems both their support for metaclasses is very limited. – allemangD Dec 12 '22 at 21:28

3 Answers3

3

This is actually non-trivial because you run into the classic problem of wanting to dynamically create types, while simultaneously having static type checkers understand them. An obvious contradiction in terms.


Quick Pydantic digression

Since you mentioned Pydantic, I'll pick up on it. The way they solve it, greatly simplified, is by never actually instantiating the inner Config class. Instead, the __config__ attribute is set on your class, whenever you subclass BaseModel and this attribute holds itself a class (meaning an instance of type).

That class referenced by __config__ inherits from BaseConfig and is dynamically created by the ModelMetaclass constructor. In the process it inherits all the attributes set by the model's base classes and overrides them with whatever you set in the inner Config.

You can see the consequences in this example:

from pydantic import BaseConfig, BaseModel

class Model(BaseModel):
    class Config:
        frozen = True

a = BaseModel()
b = Model()
a_conf = a.__config__
b_conf = b.__config__

assert isinstance(a_conf, type) and issubclass(a_conf, BaseConfig)
assert isinstance(b_conf, type) and issubclass(b_conf, BaseConfig)
assert not a_conf.frozen
assert b_conf.frozen

By the way, this is why you should not refer to the inner Config directly in your code. It will only have the attributes you set on that one class explicitly and nothing inherited, not even the defaults from BaseConfig. The documented way to access the full model config is via __config__.

This is also why there is no such thing as model instance config. Change an attribute of __config__ and you'll change it for the entire class/model:

from pydantic import BaseModel

foo = BaseModel()
bar = BaseModel()
assert not foo.__config__.frozen
bar.__config__.frozen = True
assert foo.__config__.frozen

Possible solutions

An important constraint of this approach is that it only really makes sense, when you have some fixed type that all these dynamically created classes can inherit from. In the case of Pydantic it is BaseConfig and the __config__ attribute is annotated accordingly, namely with type[BaseConfig], which allows a static type checker to infer the interface of that __config__ class.

You could of course go the opposite way and allow literally any inner class to be defined for Data on your classes, but this probably defeats the purpose of your design. It would work fine though and you could hook into class creation via the meta class to enforce that Data is set and a class. You could even enforce that specific attributes on that inner class are set, but at that point you might as well have a common base class for that.

If you wanted to replicate the Pydantic approach, I can give you a very crude example of how this can be accomplished, with the basic ideas shamelessly stolen from (or inspired by) the Pydantic code.

You can set up a BaseData class and fully define its attributes for the annotations and type inferences down the line. Then you set up your custom meta class. In its __new__ method you perform the inheritance loop to dynamically build the new BaseData subclass and assign the result to the __data__ attribute of the new outer class:

from __future__ import annotations
from typing import ClassVar, cast

class BaseData:
    foo: str = "abc"
    bar: int = 1

class CustomMeta(type):
    def __new__(
        mcs,
        name: str,
        bases: tuple[type],
        namespace: dict[str, object],
        **kwargs: object,
    ) -> CustomMeta:
        data = BaseData
        for base in reversed(bases):
            if issubclass(base, Base):
                data = inherit_data(base.__data__, data)
        own_data = cast(type[BaseData], namespace.get('Data'))
        data = inherit_data(own_data, data)
        namespace["__data__"] = data
        cls = super().__new__(mcs, name, bases, namespace, **kwargs)
        return cls

def inherit_data(
    own_data: type[BaseData] | None,
    parent_data: type[BaseData],
) -> type[BaseData]:
    if own_data is None:
        base_classes: tuple[type[BaseData], ...] = (parent_data,)
    elif own_data == parent_data:
        base_classes = (own_data,)
    else:
        base_classes = own_data, parent_data
    return type('Data', base_classes, {})

...  # more code below...

With this you can now define your Base class, annotate __data__ in its namespace with type[BaseData], and assign BaseData to its Data attribute. The inner Data classes on all derived classes can now define just those attributes that are different from their parents' Data. To demonstrate that this works, try this:

...  # Code from above

class Base(metaclass=CustomMeta):
    __data__: ClassVar[type[BaseData]]
    Data = BaseData


class Derived1(Base):
    class Data:
        foo = "xyz"


class Derived2(Derived1):
    class Data:
        bar = 42


if __name__ == "__main__":
    obj0 = Base()
    obj1 = Derived1()
    obj2 = Derived2()
    print(obj0.__data__.foo, obj0.__data__.bar)  # abc 1
    print(obj1.__data__.foo, obj1.__data__.bar)  # xyz 1
    print(obj2.__data__.foo, obj2.__data__.bar)  # xyz 42

Static type checkers will of course also know what to expect from the __data__ attribute and IDEs should give proper auto-suggestions for it. If you add reveal_type(obj2.__data__.foo) and reveal_type(obj2.__data__.bar) at the bottom and run mypy over the code, it will output that the revealed types are str and int respectively.


Caveat

An important drawback of this approach is that the inheritance is abstracted away in such a way that the inner Data class is treated as its own class unrelated to BaseData in any way by a static type checker, which makes sense because that is what it is; it just inherits from object.

Thus, you will not get any suggestions about the attributes you can override on Data by your IDE. This is the same deal with Pydantic, which is one of the reasons they roll their own custom plugins for mypy and PyCharm for example. The latter allows PyCharm to suggest you the BaseConfig attributes, when you are writing the inner Data class on any derived class.

Daniil Fajnberg
  • 12,753
  • 2
  • 10
  • 41
  • Thanks so much for all the information! It'll take me a bit to digest this and really grok what's going on. To the questions/comments on the answer: In practice, `Data` would have a common base class that inspects the annotations and generates corresponding serialization logic. There is a meaningful `BaseData`, but the power of this is pattern is it allows each type to add fields to its nested `Data` which will be serialized (separate from other instance attributes). I mention `pydantic` and omitted that base class because I didn't think it was relevant, but turns out it is! Thanks. – allemangD Dec 13 '22 at 15:05
  • @allemangD Well, that might make it easier from a certain perspective. If you are OK with the user being forced to always set the inner class correctly by inheriting from `BaseData`, i.e. `class Data(BaseData): ...`, you could do all sorts of useful things and remain type safe and transparent to static analysis tools. The trade-off is that then you **cannot** omit the inheritance. You _could_ again do that dynamically (as shown above), but that would be entirely opaque to a static type checker. – Daniil Fajnberg Dec 13 '22 at 15:20
  • To be clear: my goal is for instances of `Derived` to have an attribute which is an instance of `Derived.Data`; and `Derived.Data` may add fields to `DataBase`. It would be best if the type checker could see those fields on `obj.data`. – allemangD Dec 13 '22 at 15:24
  • @allemangD Ah, I misunderstood you. Your first comment was just a polite way of saying that none of this is actually useful.^^ You are more committed to your initially described setup than I thought. Maybe someone will find a more suitable solution for you. – Daniil Fajnberg Dec 13 '22 at 15:33
  • I'm getting the sense that the initial requirement is impossible, at least without a mypy extension.. I'm racking my brain to see if there's some way to avoid creating any instance of `Data` or `__data__`... there might be a way to use the metaclass to build descriptors which can be accessed via `__data__` or a similar class attribute? Also I wouldn't say it's useless, digging in is teaching me a whole bunch about how pydantic handles these things that seems _almost_ what I need, but not quite. – allemangD Dec 13 '22 at 15:37
1

I know I already provided one answer, but after the little back-and-forth, I thought of another possible solution involving an entirely different design from what I proposed earlier. I think this improves readability, if I post it as a second answer.


No inner classes; just a single type argument

See here for the details about how you can access the type argument provided during subclassing.

from typing import Generic, TypeVar, get_args, get_origin


D = TypeVar("D", bound="BaseData")


class BaseData:
    foo: str = "abc"
    bar: int = 1


class Base(Generic[D]):
    __data__: type[D]

    @classmethod
    def __init_subclass__(cls, **kwargs: object) -> None:
        super().__init_subclass__(**kwargs)
        for base in cls.__orig_bases__:  # type: ignore[attr-defined]
            origin = get_origin(base)
            if origin is None or not issubclass(origin, Base):
                continue
            type_arg = get_args(base)[0]
            # Do not set the attribute for GENERIC subclasses!
            if not isinstance(type_arg, TypeVar):
                cls.__data__ = type_arg
                return

Usage:

class Derived1Data(BaseData):
    foo = "xyz"


class Derived1(Base[Derived1Data]):
    pass


class Derived2Data(Derived1Data):
    bar = 42
    baz = True


class Derived2(Base[Derived2Data]):
    pass


if __name__ == "__main__":
    obj1 = Derived1()
    obj2 = Derived2()
    assert "xyz" == obj1.__data__.foo == obj2.__data__.foo
    assert 42 == obj2.__data__.bar
    assert not hasattr(obj1.__data__, "baz")
    assert obj2.__data__.baz

Adding reveal_type(obj1.__data__) and reveal_type(obj2.__data__) for mypy will show type[Derived1Data] and type[Derived2Data] respectively.

The downside is obvious: It is not the "inner class"-design you had in mind.

The upside is that is entirely type safe, while requiring minimal code. The user merely needs to provide his own BaseData subclass as a type argument, when subclassing Base.


Adding the instance (optional)

If you want to have __data__ be an instance attribute and actual instance of the specified BaseData subclass, this is also easily accomplished. Here is a crude but working example:

from typing import Generic, TypeVar, get_args, get_origin


D = TypeVar("D", bound="BaseData")


class BaseData:
    foo: str = "abc"
    bar: int = 1

    def __init__(self, **kwargs: object) -> None:
        self.__dict__.update(kwargs)


class Base(Generic[D]):
    __data_cls__: type[D]
    __data__: D

    @classmethod
    def __init_subclass__(cls, **kwargs: object) -> None:
        super().__init_subclass__(**kwargs)
        for base in cls.__orig_bases__:  # type: ignore[attr-defined]
            origin = get_origin(base)
            if origin is None or not issubclass(origin, Base):
                continue
            type_arg = get_args(base)[0]
            # Do not set the attribute for GENERIC subclasses!
            if not isinstance(type_arg, TypeVar):
                cls.__data_cls__ = type_arg
                return

    def __init__(self, **data_kwargs: object) -> None:
        self.__data__ = self.__data_cls__(**data_kwargs)

Usage:

class DerivedData(BaseData):
    foo = "xyz"
    baz = True


class Derived(Base[DerivedData]):
    pass


if __name__ == "__main__":
    obj = Derived(baz=False)
    print(obj.__data__.foo)  # xyz
    print(obj.__data__.bar)  # 1
    print(obj.__data__.baz)  # False

Again, a static type checker will know that __data__ is of the DerivedData type.

Though, I suppose at that point you might as well just have the user provide his own instance of a BaseData subclass during initialization of Derived. Maybe this is a cleaner and more intuitive design anyway.

I think your initial idea will only work, if you roll your own plugins for static type checkers.

Daniil Fajnberg
  • 12,753
  • 2
  • 10
  • 41
  • Your implementation of `__init_subclass__` is far more robust than anything I managed to create - I'm impressed! "Maybe this is a cleaner and more intuitive design anyway." I think I will follow this advice and abandon the "magic" in inferring the type of the inner class. If I require the derived class to provide an annotation for `data` rather than a generic, I could keep the type checker happy but also avoid parsing the generic types. It would also be agnostic to using an inner or outer class. I'll try this and create a separate answer if it works. Thanks again for the guidance! – allemangD Dec 13 '22 at 18:33
0

It is not completely DRY, but given advice from @daniil-fajnberg I think this is probably preferable. Explicit is better than implicit, right?

The idea is to require derived classes to specify a type annotation for data; type checkers will be happy since the derived classes all annotate with the correct type, and the base class only needs to inspect that single annotation to determine the runtime type.

from typing import ClassVar, TypeVar, get_type_hints


class Base:
    __data_cls__: ClassVar[type]

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)

        hints = get_type_hints(cls)

        if 'data' in hints:
            if isinstance(hints['data'], TypeVar):
                raise TypeError('Cannot infer __data_cls__ from TypeVar.')
            cls.__data_cls__ = hints['data']

    def __init__(self):
        self.data = self.__data_cls__()

Usage looks like this. Note the name of the data type and the data attribute are no longer coupled.

class Derived1(Base):
    class TheDataType:
        foo: str = ''
        bar: int = 77

    data: TheDataType


print('Derived1:')
obj1 = Derived1()
reveal_type(obj1.data)  # Derived1.TheDataType
reveal_type(obj1.data.foo)  # str
reveal_type(obj1.data.bar)  # int

And that decoupling means you are not required to use an inner type.

class Derived2(Base):
    data: Derived1.TheDataType


print('Derived3:')
obj2 = Derived2()
reveal_type(obj2.data)  # Derived1.TheDataType
reveal_type(obj2.data.foo)  # str
reveal_type(obj2.data.bar)  # int

I don't think it's possible to support generic subclasses in this solution. It might be possible to adapt the code in https://stackoverflow.com/a/74788026/4672189 to fetch the runtime type in certain situations.

allemangD
  • 31
  • 5