Pandera columns joint uniqueness

Question

I need to check a data frame for joint uniqueness of similar columns. In the documentation I have found this code snippet but it is applicable only to DataFrameSchema.

import pandas as pd
import pandera as pa

schema = pa.DataFrameSchema(
    columns={col: pa.Column(int) for col in ["a", "b", "c"]},
    unique=["a", "c"],
    report_duplicates = "exclude_first",
)
df = pd.DataFrame.from_records([
    {"a": 1, "b": 2, "c": 3},
    {"a": 1, "b": 2, "c": 3},
])
schema.validate(df)


null_schema = DataFrameSchema({
    "column1": Column(float, Check(lambda x: x > 0), nullable=True)
})

print(null_schema.validate(df))

How would I implement that for a SchemaModel other than resorting to data frame wide schema checks?

Is there a Field configuration for lambda checks at field level similar to this?

null_schema = DataFrameSchema({
    "column1": Column(float, Check(lambda x: x > 0), nullable=True)
})

print(null_schema.validate(df))

score 0 · Answer 1 · answered Nov 07 '22 at 17:00

I believe the unique keyword is what you're looking for, but the example in the docs is not particularly helpful in pointing out the difference between the schema-level and column-level check.

Secondly, DataFrameSchema's are for this use-case interchangeable with a SchemaModel. The example below uses your example with a SchemaModel.

This check will pass, because you are checking the joint uniqueness of columns a, b, and c.

class TestSchema(pa.SchemaModel):

    a: pa.typing.Series[int] 
    c: pa.typing.Series[int]
    c: pa.typing.Series[int]

    class Config:
        unique=["a","b","c"]

df = pd.DataFrame.from_records([
    {"a": 1, "b": 99, "c": 3},
    {"a": 2, "b": 99, "c": 2},
    {"a": 2, "b": 0, "c": 2},
])
TestSchema.validate(df)

If we change the unique keyword to include just a and c, the check will fail as the combination (2, 2) occurs twice.

class TestSchema(pa.SchemaModel):

    a: pa.typing.Series[int] 
    c: pa.typing.Series[int]
    c: pa.typing.Series[int]

    class Config:
        unique=["a","c"]

df = pd.DataFrame.from_records([
    {"a": 1, "b": 99, "c": 3},
    {"a": 2, "b": 99, "c": 2},
    {"a": 2, "b": 0, "c": 2},
])
TestSchema.validate(df)

Pandera columns joint uniqueness

1 Answers1