2

I'm trying to use Hypothesis to generate a set of dataframes that I'll merge together. I want each individual column to be allowed to have NaN values, and I want to allow Hypothesis to generate some wacky examples.

But I mostly want to focus on examples where there is at least one row in each dataframe with actual values - and in particular, I'd like to be able to generate dataframes with some information shared between corresponding columns, such that a merged dataframe is not empty. (E.g. I want some values from 'store' in store.csv to overlap with values from 'store' in train.csv.)

I have some example code here that generates NaN values and wacky examples all over the place, but most of the generated examples contain very few non-NaN values. (A dataframe strategy starts on line 57.)

Any suggestions for how to create slightly more 'realistic' examples? Thanks!

c74
  • 75
  • 4

2 Answers2

3

Your solution looks fine to me, but here's two more tactics that might help:

  1. Use the fill=st.nothing() argument to columns and series, to disable filling behaviour. This makes the entries dense instead of sparse(ish), so there's a substantial runtime cost but noticable change in the example density. Alternatively fill=st.floats(allow_nan=False) might be cheaper and still work!

  2. Use a .filter(...) on the strategy to reject dataframes without any nan-free rows. A typical rule of thumb is to avoid using .filter when it would reject more than half the examples and look for an alternative when it's over a tenth... but this could be combined with the first point easily enough.

Zac Hatfield-Dodds
  • 2,455
  • 6
  • 19
2

Answering my own question, but I'd love to hear other answers.

I ended up doing two things:

1) Requiring that the end user not give garbage files. (Just because we have a magical property-generation framework doesn't absolve us of the responsibility of having common sense, which I forgot.)

2) Testing for things that are reasonable accidents but not absolute garbage, by requiring that each dataframe have at least one row with no NaNs. With that requirement, I generate the non-NaN dataframe, and then add some NaNs afterward.

From there, ipython and .example() make it easy to see what's going on.

Example code below (google_files and google_weeks are custom strategies previously created)

# Create dataframes from the strategies above                              
# We'll create dataframes with all non-NaN values, then add NaNs to rows
# after the fact                                                           
df = draw(data_frames([                                     
    column('file', elements=google_files),                     
    column('week', elements=google_weeks),                            
    column('trend',                                            
           elements=(integers(min_value=0, max_value=100)))],           
    index=range_indexes(min_size=1, max_size=100)))                    

# Add the nans
# With other dataframes, this ended up getting written into a function                                     
rows = len(df)                                                 
df.loc[rows+1] = [np.NaN, '2014-01-05 - 2014-01-11', 42]      
df.loc[rows+2] = ['DE_BE', np.NaN, 42]               
df.loc[rows+3] = ['DE_BE', '2014-01-05 - 2014-01-11', np.NaN]        
df.loc[rows+4] = [np.NaN, np.NaN, np.NaN] 
c74
  • 75
  • 4