0

I am using hypothesis, specifically the numpy extension, to write tests to upgrade a tensorflow model.

This involves generating a number of tensors that share dimensions, such as batch size. For example, what I would like to do:

batch_size = integers(min_value=1, max_value=512)
hidden_state_size = integers(min_value=1, max_value=10_000)

@given(
    arrays(dtype=float32, shape=(batch_size, integers(min_value=1, max_value=10_000)),
    arrays(dtype=float32, shape=(batch_size, hidden_state_size)),
    arrays(dtype=float32, shape=(batch_size, hidden_state_size, integers(min_value=1, max_value=10_000)),
)
def test_code(input_array, initial_state, encoder_state):
    ...

but obviously this doesn't work because shape requires ints not integerss.

I could use a @composite decorated function to generate all the necessary tensors and unpack them within the test but this requires a lot of boiler plate that is difficult to read and slow to develop with.

I've also looked at the shared strategy but couldn't get that working.

Any suggestions would be appreciated because I think this would be a great tool for hardening NN code.

JMinton
  • 113
  • 1
  • 6

2 Answers2

1

You might like using the data strategy. If you want to share something, you can generate it in the top-level @given(...), and then use it multiple times inside the test method body. The data() strategy generates a data object, which can "draw" from Hypothesis strategies like st.integers() or nps.arrays() via data.draw(<your strategy>).

from hypothesis import strategies as st
from hypothesis.extra import numpy as nps

@given(ndim=st.integers(min_value=1, max_value=32), data=st.data())
def test_code(ndim, data):
    strategy = nps.arrays(
        dtype=np.float32,
        shape=nps.array_shapes(min_dims=ndim, max_dims=ndim),
    )
    array1 = data.draw(strategy)
    array2 = data.draw(strategy)
    ...

Note the shape kwarg either takes a Hypothesis strategy (such as nps.array_shapes()), or a specific shape (e.g. 10, (10,), (3, 3, 3), etc). Also note NumPy arrays can't take more than 32 dimensions.

honno
  • 53
  • 1
  • 6
  • 1
    That's definitely tidier than what I've been doing. How does `data` work with `@example`? Edit: from the docs "The downside of this power is that data() is incompatible with explicit @example(...)s" – JMinton Dec 03 '21 at 02:04
  • Yeah. It can be a tricky balance when deciding between data vs shared and using examples or not using examples. From my experience it's best to 1) rely on data as shared gets confusing fast, and 2) dont use examples unless you're like, CPython itself and need 110% reliability—Hypothesis may generate randomly but it shrinks quite consistently. – honno Dec 03 '21 at 10:39
1

The trick is to use shared *and define your shapes with the tuples strategy: a tuple of strategies is not a valid shape argument, but a strategy for tuples-of-ints is. That looks like:

batch_size = shared(integers(min_value=1, max_value=512))
hidden_state_size = shared(integers(min_value=1, max_value=10_000))

@given(
    arrays(dtype=float32, shape=tuples(batch_size, integers(min_value=1, max_value=10_000)),
    arrays(dtype=float32, shape=tuples(batch_size, hidden_state_size)),
    arrays(dtype=float32, shape=tuples(batch_size, hidden_state_size, integers(min_value=1, max_value=10_000)),
)
def test_code(input_array, initial_state, encoder_state):
    ...

Separately, I would also suggest reducing the maximum sizes considerably - running (many) more tests on smaller arrays is likely to catch more bugs in the same length of time. But check --hypothesis-show-statistics and profile before blindly applying performance advice!

Zac Hatfield-Dodds
  • 2,455
  • 6
  • 19