13

I'd like to describe the DataFrame structure my Python function expects, and a verbal description like:

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        column 1 is called 'thing1' and it is of dtype 'i4'"
    """
    ...

feels error prone. Is there a conventional way to describe it? I don't see anything in the Pandas docstring documentation.

jkmacc
  • 6,125
  • 3
  • 30
  • 27
  • Maybe this helps: https://pandas.pydata.org/pandas-docs/stable/development/contributing_docstring.html – Primoz Aug 01 '19 at 08:37
  • 2
    This link is good general advice, but doesn't appear to describe how to specify DataFrame schema in docstrings. – jkmacc Feb 06 '20 at 19:43

1 Answers1

11

Since there is no official standard, my answer is inevitable, opinionated.


ANSWER

I suggest using a description based on the repr() of a Series because each dataframe can be described as a collection of series. It should also be based on the pandas docstring guide for developers.

def myfun(input):
    """ Does a thing.
    Parameters
    ----------
    input : pandas.DataFrame
        Index:
            RangeIndex
        Columns:
            Name: Date, dtype: datetime64[ns]
            Name: Integer, dtype: int64
            Name: Float, dtype: float64
            Name: Object, dtype: object

    """

Example dataframe:

data = [[pd.Timestamp(2020, 1, 1), 1, 1.1, "A"],
        [pd.Timestamp(2020, 1, 2), 2, 2.2, "B"]]
input = pd.DataFrame.from_records(data=data, columns=['Date', 'Integer', 'Float', 'Object'])

Output:

    Date        Integer     Float   Object
0   2020-01-01  1           1.1     A
1   2020-01-02  2           2.2     B

GENERAL DEFINITION

<dataframe name>: pandas.DataFrame
    Index:
        <__repr__ of Index>
            <Optional: Description of index data>
    Columns:
        <last line of __repr__ of pd.Series object of first column>
            <Optional: Description of column data>
        ...
        <last line of __repr__ of pd.Series object of last column>
            <Optional: Description of column data>

EXPLANATION

There is a detailed discussion of how table data can be standardized. From this discussion, standards such as ISO/IEC 11179, the JSON Table Schema and the W3C Tabular Data Model emerged. However, they are not perfect for describing a dataframe in a docstring. For example, you need to consider relationships with other tables, which is important for database applications, but not for Pandas dataframes.

My proposed repr-based approach has several advantages:

  • It respects the opinion of the core developers of Pandas. The repr was what we should see about the object.
  • It is efficient. Let's face it, documentation is difficult. Automation is very simple with this approach. An example can be found below.
  • It is evolving. If the repr ever changes, the docstring also changes.
  • It is expandable. If you like to include additional meta data, the dataframe object has many more attributes that you can include.

Example of an automatically generated docstring with additional meta data:

df = input.copy()
df = df.set_index('Date')
docstring = 'Index:\n'
docstring = docstring + f'    {df.index}\n'
docstring = docstring + 'Columns:\n'
for col in df.columns:    
    docstring = docstring + f'    Name: {df[col].name}, dtype={df[col].dtype}, nullable: {df[col].hasnans}\n'

Output:

Index:
    DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', name='Date', freq=None)
Columns:
    Name: Integer, dtype=int64, nullable: False
    Name: Float, dtype=float64, nullable: False
    Name: Object, dtype=object, nullable: False
above_c_level
  • 3,579
  • 3
  • 22
  • 37