Strict YAML deserialization in Python with marshmallow

The problem

  • I want to read some not so simple config from .yaml file.
  • I have config structure described as dataclasses.
  • I want to all type checks have been performed and in case of invalid data exception will be raised.

So basically I want something like

def strict_load_yaml(yaml: str, loaded_type: Type[Any]):
    """ Here is some magic """
    pass

Enter fullscreen mode Exit fullscreen mode

And then use it like this:

@dataclass
class MyConfig:
    """ Here is object tree """
    pass

try:
    config = strict_load_yamp(open("config.yaml", "w").read(), MyConfig)
except Exception:
    logging.exception("Config is invalid")

Enter fullscreen mode Exit fullscreen mode

Config classes

Here is my config.py file with example dataclasses:

from dataclasses import dataclass
from enum import Enum
from typing import Optional


class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"


@dataclass
class BattleStationConfig:
    @dataclass
    class Processor:
        core_count: int
        manufacturer: str

    processor: Processor
    memory_gb: int
    led_color: Optional[Color] = None

Enter fullscreen mode Exit fullscreen mode

Solution that didn’t work

This is a very common pattern, right?
It must be very easy.
Just import standard yaml library and problem solved?

So I imported PyYaml and call load method:

from pprint import pprint

from yaml import load, SafeLoader


yaml = """ processor: core_count: 8 manufacturer: Intel memory_gb: 8 led_color: red """

loaded = load(yaml, Loader=SafeLoader)
pprint(loaded)

Enter fullscreen mode Exit fullscreen mode

and I have got:

{'led_color': 'red',
 'memory_gb': 8,
 'processor': {'core_count': 8, 'manufacturer': 'Intel'}}

Enter fullscreen mode Exit fullscreen mode

Yaml loaded just fine, but it is a dict.
No problem, I can pass it as **args constructor:

parsed_config = BattleStationConfig(**loaded)
pprint(parsed_config)

Enter fullscreen mode Exit fullscreen mode

and result will be:

BattleStationConfig(processor={'core_count': 8, 'manufacturer': 'Intel'}, memory_gb=8, led_color='red')

Enter fullscreen mode Exit fullscreen mode

Wow! Easy! But… Wait. Is processor field a dict? Damn it.

Python don’t perform type checking at constructor and do not parse Processor class.
Well, this is the time to go to stackowerflow.

Solution that required yaml tags and almost works

I’ve read stackowerflow answers and PyYaml documentation and have found out that you can mark your yaml doc with tags for types.
Your classes must be descendants of YAMLObject and so my config_with_tag.py will look like this:

from dataclasses import dataclass
from enum import Enum
from typing import Optional

from yaml import YAMLObject, SafeLoader


class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"


@dataclass
class BattleStationConfig(YAMLObject):
    yaml_tag = "!BattleStationConfig"
    yaml_loader = SafeLoader

    @dataclass
    class Processor(YAMLObject):
        yaml_tag = "!Processor"
        yaml_loader = SafeLoader

        core_count: int
        manufacturer: str

    processor: Processor
    memory_gb: int
    led_color: Optional[Color] = None

Enter fullscreen mode Exit fullscreen mode

And loading code:

from pprint import pprint

from yaml import load, SafeLoader

from config_with_tag import BattleStationConfig


yaml = """ --- !BattleStationConfig processor: !Processor core_count: 8 manufacturer: Intel memory_gb: 8 led_color: red """

a = BattleStationConfig

loaded = load(yaml, Loader=SafeLoader)
pprint(loaded)

Enter fullscreen mode Exit fullscreen mode

And what I will get?

BattleStationConfig(processor=BattleStationConfig.Processor(core_count=8, manufacturer='Intel'), memory_gb=8, led_color='red')

Enter fullscreen mode Exit fullscreen mode

Good. But my YAML is full of tags and lost its readability. And Color is still string. So I can just add YAMLObject to parent classes? Right? No.

class Color(Enum, YAMLObject):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"

Enter fullscreen mode Exit fullscreen mode

Will lead to:

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

Enter fullscreen mode Exit fullscreen mode

I didn’t find a quick way to resolve it. And I did want to add tags to my yaml, so I’ve decided to keep looking for a solution.

Solution with marshmallow

I found a recommendation to use marshmallow to parse dict generated from JSON object.
I decided that these cases are the same as mine only uses JSON instead of YAML.
And so I tried to use class_schema generator for dataclass schema:

from pprint import pprint

from yaml import load, SafeLoader
from marshmallow_dataclass import class_schema

from config import BattleStationConfig


yaml = """ processor: core_count: 8 manufacturer: Intel memory_gb: 8 led_color: red """

loaded = load(yaml, Loader=SafeLoader)
pprint(loaded)

BattleStationConfigSchema = class_schema(BattleStationConfig)

result = BattleStationConfigSchema().load(loaded)
pprint(result)

Enter fullscreen mode Exit fullscreen mode

And I get:

marshmallow.exceptions.ValidationError: {'led_color': ['Invalid enum member red']}

Enter fullscreen mode Exit fullscreen mode

So, marshmallow wants enum name, not value. I can change my yaml to:

processor:
  core_count: 8
  manufacturer: Intel
memory_gb: 8
led_color: RED

Enter fullscreen mode Exit fullscreen mode

And I will get my ideally deserialized object:

BattleStationConfig(processor=BattleStationConfig.Processor(core_count=8, manufacturer='Intel'), memory_gb=8, led_color=<Color.RED: 'red'>)

Enter fullscreen mode Exit fullscreen mode

But I felt there was a way to use my original yaml.
So I’ve explored marshmallow documentation and found following lines:

Setting by_value=True. This will cause both dumping and loading to use the value of the enum.

Turn out, you can pass this configuration to metadata dictionary of field generator from dataclasses like this:

@dataclass
class BattleStationConfig:
    led_color: Optional[Color] = field(default=None, metadata={"by_value": True})

Enter fullscreen mode Exit fullscreen mode

And I will get the object parsed from my original yaml.

Magic function

And after all I can collect my magic function:

def strict_load_yaml(yaml: str, loaded_type: Type[Any]):
    schema = class_schema(loaded_type)
    return schema().load(load(yaml, Loader=SafeLoader))

Enter fullscreen mode Exit fullscreen mode

This function can require additional set up for dataclass but solve my problem and do not require tags in yaml.

Some words about ForwardRef

If you define your dataclasses with forward reference (string with class name) marshmallow can be confused and didn’t parse your classes.

For example this configuration

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional, ForwardRef


@dataclass
class BattleStationConfig:
    processor: ForwardRef("Processor")
    memory_gb: int
    led_color: Optional["Color"] = field(default=None, metadata={"by_value": True})

    @dataclass
    class Processor:
        core_count: int
        manufacturer: str


class Color(Enum):
    RED = "red"
    GREEN = "green"
    BLUE = "blue"

Enter fullscreen mode Exit fullscreen mode

will lead to

marshmallow.exceptions.RegistryError: Class with name 'Processor' was not found. You may need to import the class.

Enter fullscreen mode Exit fullscreen mode

And if we move Processor class upper marshmallow will lost Color with the same error.
So keep your classes without ForwardRef if possible.

Code

All code available on GitHub repository.

原文链接:Strict YAML deserialization in Python with marshmallow

© 版权声明
THE END
喜欢就支持一下吧
点赞5 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容