Writing property based tests

Instead of comparing produced with expected outputs, we could use properties which the function must satisfy as the criteria. Let us consider a function cat which takes two str as input and outputs the concatenation of them. Using example based tests, we would feed the function with different strings and compare the outputs, i.e.

@pytest.mark.parametrize('a, b, expected', [('Hello', 'World', 'HelloWorld')])
def test_cat_example(a, b, expected):
    assert cat(a, b) == expected

With property based tests, however, we can use the property that the concatenated string must contain and only contain both inputs in the given order and nothing else. This yields the following listing:

@pytest.mark.parametrize('a, b', [('Hello', 'World')])
def test_cat_property(a, b):
    out = cat(a, b)
    assert a == out[:len(a)]
    assert b == out[len(a):]

Although we used an example here, no expected output is needed. So in principle we could use randomly generated strings. Because property based tests are designed to test a large number of inputs, smart ways of choosing inputs and finding errors have been developed. All of these are implemented in the module Hypothesis, which can be installed by calling

conda install -c conda-forge hypothesis

The package Hypothesis contains two key components. The first one is called strategies. This module contains a range of functions which return a search strategy; an object with methods that describe how to generate and simplify certain kinds of values. The second one is the @given decorator, which takes a test function and turns it into a parametrized one which, when called, will run the test function over a wide range of matching data from the selected strategy.

For multiplication, we could use the property that the product must divide both input, i.e.

assert r % a == 0
assert r % b == 0

But since we have a reference method, we can combine the best from two worlds: intuitive output comparison from example based testing and smart algorithms as well as lots of inputs from property based testing. The test function can thus be written like

@given(st.integers(), st.integers())
def test_mul_property(a, b):
    assert mul(a, b) == a * b

Since this test is very similar to the test with randomly generated examples, we expect it to pass too. Calling pytest, the test failed however. Hypothesis gives us the following output:

Falsifying example: test_mul_property(
    a=-1, b=1,
)

Of course! We did not have negative numbers in mind while implementing this algorithm! But why did we not discover this problem with our last test? In order to generate random inputs, we have used the function randrange, which does not include negative numbers as possible outputs. This fact is however easily overlooked. By using predefined, well-thought strategies, we can minimize human-errors while designing tests.

After writing this problem down to fix it later, we can continue testing by excluding negative numbers. This can be achieved by using min_value argument of integer():

@given(st.integers(min_value=0), st.integers(min_value=0))
def test_mul_property_non_neg(a, b):
    assert mul(a, b) == a * b

Using only non-negative integers, the test passes. We can now test the div function by writing

@given(st.integers(), st.integers())
def test_div_property(a, b):
    assert div(a, b) == a // b

This time, Hypothesis tells us

E   hypothesis.errors.MultipleFailures: Hypothesis found 2 distinct failures.

with

Falsifying example: test_div_property(
    a=0, b=-1,
)

Falsifying example: test_div_property(
    a=0, b=0,
)

From this, we can exclude negative dividends and non-positive divisors by writing

@given(st.integers(min_value=0), st.integers(min_value=1))
def test_div_property_no_zero(a, b):
    assert div(a, b) == a // b

After this modification, the test passes. Hypothesis provides a large amount of strategies and adaptations, with which very flexible tests can be created.

One might notice that the counterexamples raised by Hypothesis are all very "simple". This is no coincidence but deliberately made. This process is called shrinking and tries to produce the most human readable counterexample.

To see this point, we shall implement a bad multiplication routine, which breaks for a > 10 and b > 20:

def bad_mul(a, b):
    if a > 10 and b > 20:
        return 0
    else:
        return a * b

We then test this bad multiplication with

@given(st.integers(), st.integers())
def test_bad_mul(a, b):
    assert bad_mul(a, b) == a * b

Hypotheses reports

Falsifying example: test_bad_mul(
    a=11, b=21,
)

which are the smallest example which breaks the equality.

Programming in Theoretical Chemistry

Writing property based tests