21

If a function or method returns a Pandas DataFrame, how do you document the column names and column types? Is there a way to do this within Python's builtin type annotation or do you just use docstrings?

If you do just use docstrings, how do you format them to be as succinct as possible?

Seanny123
  • 8,776
  • 13
  • 68
  • 124
Keiron Stoddart
  • 328
  • 3
  • 12
  • 3
    What's the end goal? Do you want sphinx to handle this in some way? Do you want linters to be able to do something with it? Do you want IDE autocomplete to know what columns are there? Or do you just want to add some human readability? – piRSquared Jul 26 '19 at 20:09
  • 2
    It's a good question - when I wrote this question I was thinking about pure human readability although now that you mention it, it would be pretty cool if you were to be able to code the type annotation/docstring in such a way that linters would pick up on type incompatibilities. Thoughts? – Keiron Stoddart Jul 26 '19 at 20:19
  • What if your DF has 8000 columns or something - how'd you imagine that'd work? Or, if your function might, depending on some criteria, mutate the DF in such a way things could be different on each call? Documenting mutable structures is hard to start with - let alone this... sounds like writing a separate document with those conditions/expectations and referring to that in the doc string sounds more reasonable and just using type annotations to say "I return a DataFrame"... – Jon Clements Jul 26 '19 at 20:35
  • No idea, do you any suggestions? – Keiron Stoddart Jul 26 '19 at 20:38
  • Just the second half of my comment :) – Jon Clements Jul 26 '19 at 20:39
  • I got you! Yeah, separate documentation referred to in the docstring seems like it would be the best solution in that case. To be honest, when I wrote this question I wasn't even thinking about DataFrames with large dimensionality. – Keiron Stoddart Jul 26 '19 at 20:42
  • 2
    I've generally found that mostly I care about 2 or 3 metadata-type columns and the rest I can just put into a catchall, but yeah; sometimes documenting a dataframe isn't really doable. – CJR Jul 26 '19 at 20:50

3 Answers3

13

Docstring format

I use the numpy docstring convention as a basis. If a function's input parameter or return parameter is a pandas dataframe with predetermined columns, then I add a reStructuredText-style table with column descriptions to the parameter description. As an example:

def random_dataframe(no_rows):
    """Return dataframe with random data.

    Parameters
    ----------
    no_rows : int
        Desired number of data rows.

    Returns
    -------
    pd.DataFrame
        Dataframe with with randomly selected values. Data columns are as follows:

        ==========  ==============================================================
        rand_int    randomly chosen whole numbers (as `int`)
        rand_float  randomly chosen numbers with decimal parts (as `float`)
        rand_color  randomly chosen colors (as `str`)
        rand_bird   randomly chosen birds (as `str`)
        ==========  ==============================================================

    """
    df = pd.DataFrame({
        "rand_int": np.random.randint(0, 100, no_rows),
        "rand_float": np.random.rand(no_rows),
        "rand_color": np.random.choice(['green', 'red', 'blue', 'yellow'], no_rows),
        "rand_bird": np.random.choice(['kiwi', 'duck', 'owl', 'parrot'], no_rows),
    })

    return df

Bonus: sphinx compatibility

The aforementioned docstring format is compatible with the sphinx autodoc documentation generator. This is how the docstring looks like in HTML documentation that was automatically generated by sphinx (using the nature theme):

sphinx docstring

Xukrao
  • 8,003
  • 5
  • 26
  • 52
  • Very nice, the sphinx compatibility looks great. I should've thought to look at the numpy documentation. Thanks, Xukrao! One quick follow-up question, do you use any type annotation when returning a pd.DataFrame as well? – Keiron Stoddart Aug 02 '19 at 13:55
  • @KeironStoddart Currently I don't write type annotations in function signatures. But I use PyCharm which is able to automatically infer types from numpy-convention docstrings (and then apply these types in code inspections and auto-completions). So normally the type annotations inside the docstring is all that I need. – Xukrao Aug 02 '19 at 23:32
  • Gotcha, as someone who also uses Pycharm this is pretty cool to hear. I'll have to play around. Thank you! – Keiron Stoddart Aug 05 '19 at 01:35
  • @KeironStoddart Be sure to set the [Docstring format](https://www.jetbrains.com/help/pycharm/settings-tools-python-integrated-tools.html) option of PyCharm to "NumPy" (or another docstring format of your preference). – Xukrao Aug 05 '19 at 08:17
  • @Xukrao does this also work with python's type annotation and type checking. if i try to add a column that is not in the predetermined column list, will i get a warning from PyCharm/Intellij or some other typing checking mechanisms? – qkhhly Aug 21 '20 at 20:39
5

I have tried @Xukrao's method. To have a summary table is really nice.

Also inspired by another question in stackoverflow, to use the csv-table block is more convenient in terms of modification. Don't have to worry about alignment and "=". For example:

intra_edges (DataFrame): correspondence between intra-edges in
    planar graph and in multilayer graph.

    .. csv-table::
        :header: name, dtype, definition

        source_original (index), object, target in planar graph
        target_original (index), object, target in planar graph
        source, object, current source bus
        target, object, current target bus

inter_edges (DataFrame): correspondence between inter-nodes in
    planar graph and inter-edges in multilayer graph.

    ======  =======  ============================  ==========
    name    dtype    definition                    is_index
    ======  =======  ============================  ==========
    node    object   name in planar graph          True
    upper   int64    integer index of upper layer  False
    lower   int64    integer index of lower layer  False
    source  object   source node in supra graph    False
    target  object   target node in supra graph    False
    ======  =======  ============================  ==========
Edward
  • 554
  • 8
  • 15
2

I do this for dataframes in docstrings where it's reasonable. Sometimes it's not reasonable.

:param dataframe: pd.DataFrame [M x (3+N)]
    'id': int
        ID column
    'value': int
        Number of things
    'color': str
        Color of things
    Remaining columns are properties; all should be float64s

There's probably a better way to do this, but I haven't found it.

CJR
  • 3,916
  • 2
  • 10
  • 23
  • Thanks CJR, as far as a docstring implementation goes this looks pretty reasonable. I like the inclusion of dimensionality, I hadn't thought of that. – Keiron Stoddart Jul 26 '19 at 20:17