1

tl;dr:

In Python is there a performant means of data integrity validation?

Synopsis:

I'm working on a framework that (A) ingests similar types of data from multiple API services that can be built out by developers, (B) allows users/developers to create integrations with the data pipeline, then (C) returns to interacting with the same or new API services.

Data flow: A -> B -> C

The 3 primary facilities the framework offers is:

  1. A shared context for a given API (allows for rate limiting facilities, account credentials, etc.)
  2. A shared context for the various data integration components (allows for one actor to generate useful output for a later actor in the pipeline).
  3. The actual data that gets passed in from the API (may be websocket or periodic/scheduled API responses).

Requirements:

With this, I need 3 sets of types of data. Within each of those sets, the data will have the following requirements:

  1. A single Python type to encompass any incoming JSON structure.
  2. The objects should be immutable.
  3. Needs to be performant. The goal is to have as near-realtime API interactions as is possible.

Nice to have:

  1. The ability to define the types minimalistically in groupings or definition files.
  2. The attributes should be referencable by name, key, etc.
  3. Should be able to serialize to/from JSON efficiently.
  4. Security is a concern here too -- we want to trust the data hasn't been manipulated, and in some cases we want to filter it of sensitive values.

Where it's at now:

This project started out using namedtuples to translate dict-like JSON structures, but it's needing quite a bit of expansion. For example: while two different APIs have objects that are mostly similar, the actors on that data need it to be congruent. This requires a translation layer for each API, creating an additional set of object definitions for every API, and is making the demands on the namedtuples library way more complex and cumbersome.

The next step was for me to either create a metaclass that generates immutable objects based on pre-defined structures, or to subclass namedtuples, adding several helper methods.

With that, I thought I would reach out to the community and see if any of you had any ideas before I roll my own.

Bobby
  • 1,439
  • 2
  • 16
  • 30
  • 2
    your case sounds complex, but I guess all you need is [marshmallow](https://marshmallow.readthedocs.io/). – georgexsh Oct 16 '17 at 19:45
  • Your suggestion sounds like it will go a bit beyond fitting the bill, and rolling my own sounds like a lot of work. I'll build out a `marshmallow` implementation on an alternate branch and answer my own question with the details. Thanks, @georgexsh! – Bobby Oct 17 '17 at 15:57

0 Answers0