How do you specify a Pandas DataFrame schema/structure in a docstring?

Question

I'd like to describe the DataFrame structure my Python function expects, and a verbal description like:

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        column 1 is called 'thing1' and it is of dtype 'i4'"
    """
    ...

feels error prone. Is there a conventional way to describe it? I don't see anything in the Pandas docstring documentation.

Maybe this helps: https://pandas.pydata.org/pandas-docs/stable/development/contributing_docstring.html — Primoz, Aug 01 '19 at 08:37
This link is good general advice, but doesn't appear to describe how to specify DataFrame schema in docstrings. — jkmacc, Feb 06 '20 at 19:43

score 11 · Answer 1 · answered Apr 05 '20 at 10:55

Since there is no official standard, my answer is inevitable, opinionated.

ANSWER

I suggest using a description based on the repr() of a Series because each dataframe can be described as a collection of series. It should also be based on the pandas docstring guide for developers.

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        Index:
            RangeIndex
        Columns:
            Name: Date, dtype: datetime64[ns]
            Name: Integer, dtype: int64
            Name: Float, dtype: float64
            Name: Object, dtype: object

    """

Example dataframe:

data = [[pd.Timestamp(2020, 1, 1), 1, 1.1, "A"],
        [pd.Timestamp(2020, 1, 2), 2, 2.2, "B"]]
input = pd.DataFrame.from_records(data=data, columns=['Date', 'Integer', 'Float', 'Object'])

Output:

    Date        Integer     Float   Object
0   2020-01-01  1           1.1     A
1   2020-01-02  2           2.2     B

GENERAL DEFINITION

<dataframe name>: pandas.DataFrame
    Index:
        <__repr__ of Index>
            <Optional: Description of index data>
    Columns:
        <last line of __repr__ of pd.Series object of first column>
            <Optional: Description of column data>
        ...
        <last line of __repr__ of pd.Series object of last column>
            <Optional: Description of column data>

EXPLANATION

There is a detailed discussion of how table data can be standardized. From this discussion, standards such as ISO/IEC 11179, the JSON Table Schema and the W3C Tabular Data Model emerged. However, they are not perfect for describing a dataframe in a docstring. For example, you need to consider relationships with other tables, which is important for database applications, but not for Pandas dataframes.

My proposed repr-based approach has several advantages:

It respects the opinion of the core developers of Pandas. The repr was what we should see about the object.
It is efficient. Let's face it, documentation is difficult. Automation is very simple with this approach. An example can be found below.
It is evolving. If the repr ever changes, the docstring also changes.
It is expandable. If you like to include additional meta data, the dataframe object has many more attributes that you can include.

Example of an automatically generated docstring with additional meta data:

df = input.copy()
df = df.set_index('Date')
docstring = 'Index:\n'
docstring = docstring + f'    {df.index}\n'
docstring = docstring + 'Columns:\n'
for col in df.columns:    
    docstring = docstring + f'    Name: {df[col].name}, dtype={df[col].dtype}, nullable: {df[col].hasnans}\n'

Output:

Index:
    DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', name='Date', freq=None)
Columns:
    Name: Integer, dtype=int64, nullable: False
    Name: Float, dtype=float64, nullable: False
    Name: Object, dtype=object, nullable: False

Seems reasonable and it looks nice. It would be great to see something from the PyData library/maintainers. — Steven C. Howell, Aug 24 '20 at 22:41

How do you specify a Pandas DataFrame schema/structure in a docstring?

1 Answers1

Linked