2

Context

I implemented a Rust library for JSON Schema validation which operates with serde_json::Value instances. Now I want to use it from Python and considering PyO3 as my primary choice for connecting them. Python values should be converted to serde_json::Value when passed to the library and serde_json::Value should be converted back to Python inside validation errors returned by the Rust part.

One of the possible ways will be implementing serde::se::Serialize for a newtype wrapper around pyo3::types::PyAny and then passing it to serde_json::to_value, but I am not sure how efficient it will be. What are the options and what trade-offs are there?

From Python side I mostly interested in built-in types, that are serializable by json.dumps, without custom classes at the moment.

Rust-side example:

use jsonschema::{JSONSchema, Draft, CompilationError};
use serde_json::Value;

pub fn is_valid(schema: &Value, instance: &Value) -> bool {
    let compiled = JSONSchema::compile(schema, None).expect("Invalid schema");
    compiled.is_valid(instance)
}

I.e. there is a function that accepts two references to serde_json::Value and I want to expose it to Python. From the Python side there might be two use-cases:

  1. The instance is a JSON-encoded string:
import jsonschema_rs

assert jsonschema_rs.is_valid(
    {"minItems": 2}, 
    "[1, 2]"
)
  1. The instance is a Python structure (not a JSON-encoded string):
import jsonschema_rs

assert jsonschema_rs.is_valid(
    {"minItems": 2},
    [1, 2]
)

Possible use-cases

  1. Web app request/response structure validation.

    • When a request going in, then its body is validated as is, without parsing according to the schema.
    • When a response is returned, this structure is validated according to the schema before serializing to JSON;

In the future, both steps might be combined with Rust-powered JSON deserialization (on the request side) and deserialization (on the response side).

  1. Using for property-based testing as an extension for Hypothesis

In this case, the faster the input is validation, the more test cases are generated. Current implementation uses Python library under the hood which is quite slow for complex schemas that I am working with usually.

Update

I tried to implement Serialize trait here and added a comparison with raw string input as suggested by @Sven Marnach in comments. Indeed, the raw string is the fastest option, but if it involves calling json.dumps in Python it goes significantly worse than the variant with the trait.

Small objects & schema (100000 iterations):

String        : 1.31617
Trait         : 1.52797 (x1.16)
String + dumps: 2.77378 (x2.1)

Big objects & schema (100 iterations):

String        : 1.42146
Trait         : 3.70745 (x2.6)
String + dumps: 6.21213 (x4.37)

Benchmark code and test data.

Having a version for strings definitely makes sense, but calling json.dumps is quite expensive. I don't know if there are any better options for such scenarios.

Python version: 3.7

Rust version: 1.42.0

Dependencies:

  • serde_json = "1.0.48"
  • serde = "1.0.105"
  • jsonschema = "0.2.0"
Stranger6667
  • 418
  • 3
  • 17
  • 2
    The simplest solution for a first version is to use Python's own JSON serialization in the `json` module, and only pass the serialized string to Rust. Chances are this is good enough. – Sven Marnach Mar 30 '20 at 10:14
  • Indeed, it works, but implies `json.dumps` overhead on the Python side, which, I assume, will be bigger than converting Python structure on the Rust side after some input size. But probably it is the best option when there is no JSON serialization on the Python size and the input is available as a string – Stranger6667 Mar 30 '20 at 10:47
  • Implementing a direct conversion for a reasonable subset of Python's built-in types to `serde_json::Value` probably isn't too bad, but it's definitely a non-trivial amount of work, so I'd only do it if profiling shows that it's necessary. – Sven Marnach Mar 30 '20 at 11:45
  • I updated the question with a link to my straightforward trait implementation (I am not sure if it is correct for all cases, but seems to be working) and a benchmark. Calling `json.dumps` is quite expensive, but raw string is the fastest option – Stranger6667 Mar 30 '20 at 14:17
  • Your implementation for the conversion directly to `Value` is a bit easier than I expected. Looks nice! – Sven Marnach Mar 30 '20 at 20:00
  • @SvenMarnach thanks, it looks like it works for a general case. Are there some areas where that implementation could be improved? However, probably this discussion should be on https://codereview.stackexchange.com/ – Stranger6667 Apr 01 '20 at 10:22
  • 1
    At a glance, the implementation looks fine to me. I'm not sure I'd support other key types than strings – JSON doesn't, after all, and serializing a Python object like `{None: 1, "null": 2}` could lead to rather unexpected results. I know Python's `json` module does the same, but that's just as questionable. :) Moreover, `is_normal()` seems a bit stronger than you need – `is_finite()` should be sufficient. – Sven Marnach Apr 01 '20 at 10:34
  • It appears you deleted the repo with the code you linked. Would adding the code as part of the post instead? – Caesar Aug 22 '23 at 23:57

0 Answers0