Generate a Pandas Dataframe with python hypothesis library where one row is dependant on another

Question

I'm trying to use hypothesis to generate pandas dataframes where some column values are dependant on other column values. So far, I haven't been able to 'link' two columns.

This code snippet:

from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames , column, range_indexes

def create_dataframe():
    id1 = st.integers().map(lambda x: x)
    id2 = st.shared(id1).map(lambda x: x * 2)
    df = data_frames(index = range_indexes(min_size=10, max_size=100), columns=[
        column(name='id1',  elements=id1, unique=True),
        column(name='id2', elements=id2),
    ])
    return df

Produces a dataframe with a static second column:

            id1  program_id
0   1.170000e+02       110.0
1   3.600000e+01       110.0
2   2.876100e+04       110.0
3  -1.157600e+04       110.0
4   5.300000e+01       110.0
5   2.782100e+04       110.0
6   1.334500e+04       110.0
7  -3.100000e+01       110.0

Zac Hatfield-Dodds · Accepted Answer · 2021-11-05T02:25:17.267

I think that you're after the rows argument, which allows you to compute some column values from other columns. For example, if we wanted a full_price and a sale_price column where the sale price has some discount applied:

from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames, range_indexes

def create_dataframe():
    full = st.floats(1, 1000)  # all items cost $1 to $1,000
    discounts = st.sampled_from([0, 0.1, 0.25, 0.5])
    rows = st.tuples(full, discounts).map(
        lambda xs: dict(price=xs[0], sale_price=xs[0] * (1-xs[1]))
    )
    return data_frames(
        index = range_indexes(min_size=10, max_size=100),
        rows = rows
    )

         price  sale_price
0   757.264509  378.632254
1   824.384095  618.288071
2   401.187339  300.890504
3   723.193610  650.874249
4   777.171038  699.453934
5   274.321034  205.740776

So what went wrong with your example code? It looks like you imagined that the id1 and id2 strategies were defined relative to each other on a row-wise basis, but they're actually independent - and the shared() strategy shares a single value between every row in the column.

Thanks. This answers my question. I'm starting to think a sort of LSTM test data generator might be more ideal. Something that can give us examples similar to production data, but very clearly isn't production data. — Jeff Harrison, Nov 04 '21 at 17:12
Yeah, that makes sense if you're aiming for something like load-testing or a demo with synthetic data - Hypothesis really is focussed on exposing bugs and tends to come up with pretty weird examples to do so. — Zac Hatfield-Dodds, Nov 05 '21 at 02:28
It was *almost* useful for smaller tests after I figured out how to build something. Unfortunately, it was just too slow to use if I need to generate a dataframe with lots of columns and more than 5-10 rows. — Jeff Harrison, Nov 17 '21 at 18:08

Generate a Pandas Dataframe with python hypothesis library where one row is dependant on another

1 Answers1