0

I'm trying to use hypothesis to generate pandas dataframes where some column values are dependant on other column values. So far, I haven't been able to 'link' two columns.

This code snippet:

from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames , column, range_indexes

def create_dataframe():
    id1 = st.integers().map(lambda x: x)
    id2 = st.shared(id1).map(lambda x: x * 2)
    df = data_frames(index = range_indexes(min_size=10, max_size=100), columns=[
        column(name='id1',  elements=id1, unique=True),
        column(name='id2', elements=id2),
    ])
    return df

Produces a dataframe with a static second column:

            id1  program_id
0   1.170000e+02       110.0
1   3.600000e+01       110.0
2   2.876100e+04       110.0
3  -1.157600e+04       110.0
4   5.300000e+01       110.0
5   2.782100e+04       110.0
6   1.334500e+04       110.0
7  -3.100000e+01       110.0

1 Answers1

0

I think that you're after the rows argument, which allows you to compute some column values from other columns. For example, if we wanted a full_price and a sale_price column where the sale price has some discount applied:

from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames, range_indexes

def create_dataframe():
    full = st.floats(1, 1000)  # all items cost $1 to $1,000
    discounts = st.sampled_from([0, 0.1, 0.25, 0.5])
    rows = st.tuples(full, discounts).map(
        lambda xs: dict(price=xs[0], sale_price=xs[0] * (1-xs[1]))
    )
    return data_frames(
        index = range_indexes(min_size=10, max_size=100),
        rows = rows
    )
         price  sale_price
0   757.264509  378.632254
1   824.384095  618.288071
2   401.187339  300.890504
3   723.193610  650.874249
4   777.171038  699.453934
5   274.321034  205.740776

So what went wrong with your example code? It looks like you imagined that the id1 and id2 strategies were defined relative to each other on a row-wise basis, but they're actually independent - and the shared() strategy shares a single value between every row in the column.

Zac Hatfield-Dodds
  • 2,455
  • 6
  • 19
  • Thanks. This answers my question. I'm starting to think a sort of LSTM test data generator might be more ideal. Something that can give us examples similar to production data, but very clearly isn't production data. – Jeff Harrison Nov 04 '21 at 17:12
  • Yeah, that makes sense if you're aiming for something like load-testing or a demo with synthetic data - Hypothesis really is focussed on exposing bugs and tends to come up with pretty weird examples to do so. – Zac Hatfield-Dodds Nov 05 '21 at 02:28
  • It was *almost* useful for smaller tests after I figured out how to build something. Unfortunately, it was just too slow to use if I need to generate a dataframe with lots of columns and more than 5-10 rows. – Jeff Harrison Nov 17 '21 at 18:08