I would like to be able to specify the dtype within the array as well. Spark how something like ArrayType(StringType())
. Does this exist in python/pandas/numpy/pandera world
?
import pandas as pd
import pandera as pa
import numpy as np
# data to validate
df = pd.DataFrame({
"column1": [1, 4],
"column2": [-1.3, -1.4],
"column3": [["value_1","value1a"], ["value_2","value_2a"]],
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(np.str_),
})
validated_df = schema(df)
print(validated_df)
print(validated_df.dtypes)
What I get:
column1 column2 column3
0 1 -1.3 [value_1, value1a]
1 4 -1.4 [value_2, value_2a]
column1 int64
column2 float64
column3 object
dtype: object
What I'd like:
column1 column2 column3
0 1 -1.3 [value_1, value1a]
1 4 -1.4 [value_2, value_2a]
column1 int64
column2 float64
column3 ArrayType(StringType()) -> Or the equivalent
dtype: object