Is there an elegant way of using hypothesis
to directly generate complex pandas
data frames with internal row and column dependencies? Let's say I want columns such as:
[longitude][latitude][some-text-meta][some-numeric-meta][numeric-data][some-junk][numeric-data][…
Geographic coordinates can be individually picked at random, but sets must usually come from a general area (e.g. standard reprojections don't work if you have two points on opposite sides of the globe). It's easy to handle that by choosing an area with one strategy and columns of coordinates from inside that area with another. All good so far…
@st.composite
def plaus_spamspam_arrs(
draw,
st_lonlat=plaus_lonlat_arr,
st_values=plaus_val_arr,
st_areas=plaus_area_arr,
st_meta=plaus_meta_arr,
bounds=ARR_LEN,
):
"""Returns plausible spamspamspam arrays"""
size = draw(st.integers(*bounds))
coords = draw(st_lonlat(size=size))
values = draw(st_values(size=size))
areas = draw(st_areas(size=size))
meta = draw(st_meta(size=size))
return PlausibleData(coords, values, areas, meta)
The snippet above makes clean numpy
arrays of coordinated single-value data. But the numeric data in the columns example (n-columns interspersed with junk) can also have row-wise dependencies such as needing to be normalised to some factor involving a row-wise sum and/or something else chosen dynamically at runtime.
I can generate all these bits separately, but I can't see how to stitch them into a single data frame without using a clumsy concat-based technique that, I presume, would disrupt draw
-based shrinking. Moreover, I need a solution that adapts beyond what's above, so a hack likely get me too far…
Maybe there's something with builds
? I just can't quite see out how to do it. Thanks for sharing if you know! A short example as inspiration would likely be enough.
Update
I can generate columns roughly as follows:
@st.composite
def plaus_df_inputs(
draw, *, nrows=None, ncols=None, nrow_bounds=ARR_LEN, ncol_bounds=COL_LEN
):
"""Returns …"""
box_lon, box_lat = draw(plaus_box_geo())
ncols_jnk = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
ncols_val = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
keys_val = draw(plaus_smp_key_elm(size=ncols_val))
nrows = draw(st.integers(*nrow_bounds)) if nrows is None else nrows
cols = (
plaus_df_cols_lonlat(lons=plaus_lon(box_lon), lats=plaus_lat(box_lat))
+ plaus_df_cols_meta()
+ plaus_df_cols_value(keys=keys_val)
+ draw(plaus_df_cols_junk(size=ncols_jnk))
)
random.shuffle(cols)
return draw(st_pd.data_frames(cols, index=plaus_df_idx(size=nrows)))
where the sub-stats are things like
@st.composite
def plaus_df_cols_junk(
draw, *, size=1, names=plaus_meta(), dtypes=plaus_dtype(), unique=False
):
"""Returns strategy for list of columns of plausible junk data."""
result = set()
for _ in range(size):
result.add(draw(names.filter(lambda name: name not in result)))
return [
st_pd.column(name=result.pop(), dtype=draw(dtypes), unique=unique)
for _ in range(size)
]
What I need is something more elegant that incorporates the row-based dependencies.