0

Is there an elegant way of using hypothesis to directly generate complex pandas data frames with internal row and column dependencies? Let's say I want columns such as:

[longitude][latitude][some-text-meta][some-numeric-meta][numeric-data][some-junk][numeric-data][…

Geographic coordinates can be individually picked at random, but sets must usually come from a general area (e.g. standard reprojections don't work if you have two points on opposite sides of the globe). It's easy to handle that by choosing an area with one strategy and columns of coordinates from inside that area with another. All good so far…

@st.composite
def plaus_spamspam_arrs(
    draw,
    st_lonlat=plaus_lonlat_arr,
    st_values=plaus_val_arr,
    st_areas=plaus_area_arr,
    st_meta=plaus_meta_arr,
    bounds=ARR_LEN,
):
    """Returns plausible spamspamspam arrays"""
    size = draw(st.integers(*bounds))
    coords = draw(st_lonlat(size=size))
    values = draw(st_values(size=size))
    areas = draw(st_areas(size=size))
    meta = draw(st_meta(size=size))
    return PlausibleData(coords, values, areas, meta)

The snippet above makes clean numpy arrays of coordinated single-value data. But the numeric data in the columns example (n-columns interspersed with junk) can also have row-wise dependencies such as needing to be normalised to some factor involving a row-wise sum and/or something else chosen dynamically at runtime.

I can generate all these bits separately, but I can't see how to stitch them into a single data frame without using a clumsy concat-based technique that, I presume, would disrupt draw-based shrinking. Moreover, I need a solution that adapts beyond what's above, so a hack likely get me too far…

Maybe there's something with builds? I just can't quite see out how to do it. Thanks for sharing if you know! A short example as inspiration would likely be enough.

Update

I can generate columns roughly as follows:

@st.composite
def plaus_df_inputs(
    draw, *, nrows=None, ncols=None, nrow_bounds=ARR_LEN, ncol_bounds=COL_LEN
):
    """Returns …"""
    box_lon, box_lat = draw(plaus_box_geo())
    ncols_jnk = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
    ncols_val = draw(st.integers(*ncol_bounds)) if ncols is None else ncols
    keys_val = draw(plaus_smp_key_elm(size=ncols_val))
    nrows = draw(st.integers(*nrow_bounds)) if nrows is None else nrows
    cols = (
        plaus_df_cols_lonlat(lons=plaus_lon(box_lon), lats=plaus_lat(box_lat))
        + plaus_df_cols_meta()
        + plaus_df_cols_value(keys=keys_val)
        + draw(plaus_df_cols_junk(size=ncols_jnk))
    )
    random.shuffle(cols)
    return draw(st_pd.data_frames(cols, index=plaus_df_idx(size=nrows)))

where the sub-stats are things like

@st.composite
def plaus_df_cols_junk(
    draw, *, size=1, names=plaus_meta(), dtypes=plaus_dtype(), unique=False
):
    """Returns strategy for list of columns of plausible junk data."""
    result = set()
    for _ in range(size):
        result.add(draw(names.filter(lambda name: name not in result)))
    return [
        st_pd.column(name=result.pop(), dtype=draw(dtypes), unique=unique)
        for _ in range(size)
    ]

What I need is something more elegant that incorporates the row-based dependencies.

curlew77
  • 393
  • 5
  • 15

1 Answers1

0
from hypothesis import strategies as st

@st.composite
def interval_sets(draw):
    # To create our interval sets, we'll draw from a strategy that shrinks well,
    # and then transform it into the format we want.  More specifically, we'll use
    # a single lists() strategy so that the shrinker can delete chunks atomically,
    # and then rearrange the floats that we draw as part of this.
    base_elems = st.tuples(
        # Different floats bounds to ensure we get at least one valid start and end.
        st.text(),
        st.floats(0, 1, exclude_max=True),
        st.floats(0, 1, exclude_min=True),
    )
    base = draw(st.lists(base_elems, min_size=1, unique_by=lambda t: t[0]))
    nums = sorted(sum((t[1:] for t in base), start=()))  # arrange our endpoints
    return [
        {"name": name, "start": start, "end": end, "size": end - start}
        for (name, _, _), start, end in zip(base, nums[::2], nums[1::2])
    ]
Zac Hatfield-Dodds
  • 2,455
  • 6
  • 19