Since there is no official standard, my answer is inevitable, opinionated.
ANSWER
I suggest using a description based on the repr() of a Series because each dataframe can be described as a collection of series. It should also be based on the pandas docstring guide for developers.
def myfun(input):
""" Does a thing.
Parameters
----------
input : pandas.DataFrame
Index:
RangeIndex
Columns:
Name: Date, dtype: datetime64[ns]
Name: Integer, dtype: int64
Name: Float, dtype: float64
Name: Object, dtype: object
"""
Example dataframe:
data = [[pd.Timestamp(2020, 1, 1), 1, 1.1, "A"],
[pd.Timestamp(2020, 1, 2), 2, 2.2, "B"]]
input = pd.DataFrame.from_records(data=data, columns=['Date', 'Integer', 'Float', 'Object'])
Output:
Date Integer Float Object
0 2020-01-01 1 1.1 A
1 2020-01-02 2 2.2 B
GENERAL DEFINITION
<dataframe name>: pandas.DataFrame
Index:
<__repr__ of Index>
<Optional: Description of index data>
Columns:
<last line of __repr__ of pd.Series object of first column>
<Optional: Description of column data>
...
<last line of __repr__ of pd.Series object of last column>
<Optional: Description of column data>
EXPLANATION
There is a detailed discussion of how table data can be standardized. From this discussion, standards such as ISO/IEC 11179, the JSON Table Schema and the W3C Tabular Data Model emerged. However, they are not perfect for describing a dataframe in a docstring. For example, you need to consider relationships with other tables, which is important for database applications, but not for Pandas dataframes.
My proposed repr-based approach has several advantages:
- It respects the opinion of the core developers of Pandas. The repr was what we should see about the object.
- It is efficient. Let's face it, documentation is difficult. Automation is very simple with this approach. An example can be found below.
- It is evolving. If the repr ever changes, the docstring also changes.
- It is expandable. If you like to include additional meta data, the dataframe object has many more attributes that you can include.
Example of an automatically generated docstring with additional meta data:
df = input.copy()
df = df.set_index('Date')
docstring = 'Index:\n'
docstring = docstring + f' {df.index}\n'
docstring = docstring + 'Columns:\n'
for col in df.columns:
docstring = docstring + f' Name: {df[col].name}, dtype={df[col].dtype}, nullable: {df[col].hasnans}\n'
Output:
Index:
DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', name='Date', freq=None)
Columns:
Name: Integer, dtype=int64, nullable: False
Name: Float, dtype=float64, nullable: False
Name: Object, dtype=object, nullable: False