I tried the following example from the Panderas documentation:
import pandera as pa
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
schema.example(size=100)
I got multiple warnings of the form:
UserWarning: Column check doesn't have a defined strategy. Falling back to filtering drawn values based on the check definition. This can considerably slow down data-generation.
and it was around 10-30 seconds for the example to finish generating. This feels like something that should be much faster out of the box. Why is it so slow?