2

I would like to optimize data generation speed for my unit tests. It seems strategies like from_regex and dictionaries take a long time to generate examples.

Below a sample I wrote to try to benchmark examples generation:

from hypothesis import given
from hypothesis.strategies import (
    booleans,
    composite,
    dictionaries,
    from_regex,
    integers,
    lists,
    one_of,
    text,
)

param_names = from_regex(r"[a-z][a-zA-Z0-9]*(_[a-zA-Z0-9]+)*", fullmatch=True)
param_values = one_of(booleans(), integers(), text(), lists(text()))


@composite
def composite_params_dicts(draw, min_size=0):
    """Provides a dictionary of parameters."""
    params = draw(
        dictionaries(keys=param_names, values=param_values, min_size=min_size)
    )

    return params


params_dicts = dictionaries(keys=param_names, values=param_values)


@given(params=params_dicts)
def test_standard(params):
    assert params is not None


@given(params=composite_params_dicts(min_size=1))
def test_composite(params):
    assert len(params) > 0


@given(integer=integers(min_value=1))
def test_integer(integer):
    assert integer > 0

The test_integer() test is a used as a reference as it uses a simple strategy.

Because some long running tests in one of my projects are using regexes to generate parameters names and dictionaries to generate those parameters, I added two tests using those strategies.

test_composite() use a composite strategy which takes an optional argument. test_standard() use a similar strategy except it is not composite.

Below the test results:

> pytest hypothesis-sandbox/test_dicts.py --hypothesis-show-statistics
============================ test session starts =============================
platform linux -- Python 3.7.3, pytest-5.0.1, py-1.8.0, pluggy-0.12.0
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/damien/Sandbox/hypothesis/.hypothesis/examples')
rootdir: /home/damien/Sandbox/hypothesis
plugins: hypothesis-4.28.2
collected 3 items                                                                                                                                                       

hypothesis-sandbox/test_dicts.py ...                                    [100%]
=========================== Hypothesis Statistics ============================

hypothesis-sandbox/test_dicts.py::test_standard:

  - 100 passing examples, 0 failing examples, 1 invalid examples
  - Typical runtimes: 0-35 ms
  - Fraction of time spent in data generation: ~ 98%
  - Stopped because settings.max_examples=100
  - Events:
    * 2.97%, Retried draw from TupleStrategy((<hypothesis._strategies.CompositeStrategy object at 0x7f72108b9630>,
    one_of(booleans(), integers(), text(), lists(elements=text()))))
    .filter(lambda val: all(key(val) not in seen 
    for (key, seen) in zip(self.keys, seen_sets))) to satisfy filter

hypothesis-sandbox/test_dicts.py::test_composite:

  - 100 passing examples, 0 failing examples, 1 invalid examples
  - Typical runtimes: 0-47 ms
  - Fraction of time spent in data generation: ~ 98%
  - Stopped because settings.max_examples=100

hypothesis-sandbox/test_dicts.py::test_integer:

  - 100 passing examples, 0 failing examples, 0 invalid examples
  - Typical runtimes: < 1ms
  - Fraction of time spent in data generation: ~ 57%
  - Stopped because settings.max_examples=100

========================== 3 passed in 3.17 seconds ==========================

Are composite strategies slower ?

How to optimize a custom strategy ?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Damien Flament
  • 1,465
  • 15
  • 27

1 Answers1

2

Composite strategies are as fast as any other way of generating the same data, but people tend to use them for large and complex inputs (which are slower than small and simple inputs)

Strategy optimisation tips reduce to "don't do slow things", as there's no way to go faster.

  • Minimise use of .filter(...) as retries are slower than no retries.
  • Cap sizes, especially of nested things.

So for your example, it might be faster if you capped the size of the lists, but otherwise it's just slow (ish!) because you're generating a lot of data but not doing much with it.

Zac Hatfield-Dodds
  • 2,455
  • 6
  • 19