How do you Unit Test Python DataFrames

Question

How do I unit test Python dataframes?

I have functions that have an input and output as dataframes. Almost every function I have does this. Now if I want to unit test this what is the best method of doing it? It seems a bit of an effort to create a new dataframe (with values populated) for every function?

Are there any materials you can refer me to? Should you write unit tests for these functions?

sechilds · Accepted Answer · 2021-06-06T11:55:04.900

52

While Pandas' test functions are primarily used for internal testing, NumPy includes a very useful set of testing functions that are documented here: NumPy Test Support.

These functions compare NumPy arrays, but you can get the array that underlies a Pandas DataFrame using the values property. You can define a simple DataFrame and compare what your function returns to what you expect.

One technique you can use is to define one set of test data for a number of functions. That way, you can use Pytest Fixtures to define that DataFrame once, and use it in multiple tests.

In terms of resources, I found this article on Testing with NumPy and Pandas to be very useful. I also did a short presentation about data analysis testing at PyCon Canada 2016: Automate Your Data Analysis Testing.

edited Jun 06 '21 at 11:55

answered Jan 25 '17 at 17:10

sechilds

859
7
5

Just a quick update, the Pytest Fixtures link is broken. Maybe they moved the page. Here's their ["About Fixtures"](https://doc.pytest.org/en/latest/explanation/fixtures.html) page! – Ryan Streur Apr 29 '21 at 19:09
2

PyData London 2019, excellent vid on the topic. https://www.youtube.com/watch?v=WTj6T0QdHHM&t=4432s – Cam Aug 29 '21 at 20:55

score 33 · Answer 2 · answered Jun 25 '18 at 06:48

33

you can use pandas testing functions:

It will give more flexbile to compare your result with computed result in different ways.

For example:

df1=pd.DataFrame({'a':[1,2,3,4,5]})
df2=pd.DataFrame({'a':[6,7,8,9,10]})

expected_res=pd.Series([7,9,11,13,15])
pd.testing.assert_series_equal((df1['a']+df2['a']),expected_res,check_names=False)

For more details refer this link

answered Jun 25 '18 at 06:48

Mohamed Thasin ah

10,754
11
52
111

3

**This is the way.** The [accepted answer](https://stackoverflow.com/a/41857520/2809027) is both useless and obsolete, like usual. Pandas test functions published by the public `pandas.testing` subpackage absolutely *are* intended for external testing in downstream test suites. That's what the `testing` means. _\*facepalm\*_ – Cecil Curry Mar 30 '23 at 07:50

score 6 · Answer 3 · answered Feb 18 '21 at 05:53

If you are using pytest, pandasSnapshot will be useful.

# use with pytest
import pandas as pd
from snapshottest_ext.dataframe import PandasSnapshot

def test_format(snapshot):
    df = pd.DataFrame([['a', 'b'], ['c', 'd']],
                      columns=['col 1', 'col 2'])
    snapshot.assert_match(PandasSnapshot(df))

One big cons is that the snapshot is not readable anymore. (store the content as csv is more readable, but it is problematic.

PS: I am the author of pytest snapshot extension.

score 4 · Answer 4 · edited Apr 06 '22 at 02:12

4

I don't think it's hard to create small DataFrames for unit testing?

import pandas as pd
from nose.tools import assert_dict_equal

input_df = pd.DataFrame.from_dict({
    'field_1': [some, values],
    'field_2': [other, values]
})
expected = {
    'result': [...]
}
assert_dict_equal(expected, my_func(input_df).to_dict(), "oops, there's a bug...")

edited Apr 06 '22 at 02:12

ggorlen

44,755
7
76
106

answered Jan 25 '17 at 13:29

rtkaleta

681
6
14

My result is a dataframe too. So i should create another dataframe? in this case i cant use assert_dict_equal? – CodeGeek123 Jan 25 '17 at 14:11
1

Yes, that's why I called `to_dict()` on the result from your function - so I get a `dict` that can be compared to `expected` with the `nose` method suggested. – rtkaleta Jan 25 '17 at 14:37
@rkaleta : yes it does.. However, my test fail with errors like AssertionError: {'ins[20 chars]on': ['TF000141124', 'TF000141124', 'TF00014[599 chars]0.0]} != {'ins[20 chars]on': {0: 'TF000141124', 1: 'TF000141124', 2:[716 chars]0.0}} Diff is 3078 characters long. Set self.maxDiff to None to see it. : oops, there's a bug... – CodeGeek123 Jan 26 '17 at 11:37
@CodeGeek123 The above code for `expected` was just an example - you have to modify to your needs. Looks like you have a mismatch in the _expected_ vs. _actual_ DataFrame structure. It looks like the function under test returns a DataFrame that is 3 rows x 1 column? Then _expected_ should be more like `expected = {: {: , : , }}` – rtkaleta Jan 26 '17 at 14:33

score 4 · Answer 5 · answered Sep 25 '20 at 20:12

You could use snapshottest and do something like this:

def test_something_works(snapshot): # snapshot is a pytest fixture from snapshottest
    data_frame = calc_something_and_return_pandas_dataframe()
    snapshot.assert_match(data_frame.to_csv(index=False), 'some_module_level_unique_name_for_the_snapshot')

This will create a snapshots folder with a file in that contains the csv output that you can update with --snapshot-update when your code changes.

It works by comparing the data_frame variable to what is saved to disk.

Might be worth mentioning that your snapshots should be checked in to source control.

score 3 · Answer 6 · answered Jan 25 '17 at 13:22

3

I would suggest writing the values as CSV in docstrings (or separate files if they're large) and parsing them using pd.read_csv(). You can parse the expected output from CSV too, and compare, or else use df.to_csv() to write a CSV out and diff it.

answered Jan 25 '17 at 13:22

John Zwinck

239,568
38
324
436

2

This is a nice idea but dealing with `csv` files can be annoying if your DataFrame has data in strange encodings, or arrays that you need to `literal_eval`, etc. If her code is structured correctly the input/expected DataFrames should be fairly small and therefore easy to construct on the fly? – rtkaleta Jan 25 '17 at 13:32

score 2 · Answer 7 · answered Aug 13 '21 at 21:13

Pandas has built in testing functions, but I don't find the output easy to parse, so I created an open source project called beavis with functions that output error messages that are easier for humans to read.

Here's an example of one of the built in testing methods:

df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])

Here's the error message:


>   ???
E   AssertionError: Series are different
E
E   Series values are different (50.0 %)
E   [index]: [0, 1, 2, 3]
E   [left]:  [1042, 2, 9, 6]
E   [right]: [5, 2, 7, 6]

Not very easy to see which rows are mismatched because the output isn't aligned.

Here's how you can write the same test with beavis.

import beavis

beavis.assert_pd_column_equality(df, "col1", "col2")

This'll give you the following readable error message:

The built-in assert_frame_equal doesn't give a readable error message either. Here's how you can compare DataFrame equality with beavis.

df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
beavis.assert_pd_equality(df1, df2)

score 1 · Answer 8 · answered Nov 23 '20 at 16:31

The frame-fixtures Python package (of which I am an author) is designed to make it easy to "create a new dataframe (with values populated)" for unit or performance tests.

For example, if you want to test against a DataFrame of floats and strings with a numerical index, you can use a compact string declaration to generate a DataFrame.

>>> ff.Fixture.to_frame('i(I,int)|v(float,str)|s(4,2)').to_pandas()
              0     1
 34715  1930.40  zaji
-3648  -1760.34  zJnC
 91301  1857.34  zDdR
 30205  1699.34  zuVU

>>> ff.Fixture.to_frame('i(I,int)|v(float,str)|s(8,3)').to_pandas()
               0     1        2
 34715   1930.40  zaji   694.30
-3648   -1760.34  zJnC   -72.96
 91301   1857.34  zDdR  1826.02
 30205   1699.34  zuVU   604.10
 54020    268.96  zKka  1080.40
 129017  3511.58  zJXD  2580.34
 35021   1175.36  zPAQ   700.42
 166924  2925.68  zyps  3338.48

How do you Unit Test Python DataFrames

8 Answers8

Linked