5

I am writing a function that returns a Pandas DataFrame object. I would like to have some kind of a type hinting what columns this DataFrame contains, outside mere specification in the documentation, as I feel this will make it much easier for the end user to read the data.

Is there a way to type hint DataFrame content that different tools like Visual Studio Code and PyCharm would support, when editing Python files and when editing Jupyter Notebooks?

An example function:


def generate_data(bunch, of, inputs) -> pd.DataFrame:
      """Massages the input to a nice and easy DataFrame.

      :return:
           DataFrame with columns a(int), b(float), c(string), d(us dollars as float)
      """
BigBen
  • 46,229
  • 7
  • 24
  • 40
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
  • 1
    Use hinterland extension: https://towardsdatascience.com/12-jupyter-notebook-extensions-that-will-make-your-life-easier-e0aae0bd181 – Gedas Miksenas Apr 20 '23 at 13:07

4 Answers4

3

As far as I am aware, there is no way to do this with just core Python and pandas.

I would recommend using pandera. It has a broader scope, but type checking dataframe column types is one of its capabilities.

pandera can also be used in conjunction with pydantic, for which in turn dedicated VS Code (via Pylance) and Pycharm plugins are available.

Arne
  • 9,990
  • 2
  • 18
  • 28
  • 3
    For a 500 point bounty I think it would be worth supplying examples that can satisfy the function OP has declared. – flakes Apr 21 '23 at 04:40
2

The most powerful project for strong typing of pandas DataFrame as of now (Apr 2023) is pandera. Unfortunately, what it offers is quite limited and far from what we might have wanted.

Here is an example of how you can use pandera in your case:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame

class MySchema(pa.DataFrameModel):
    a: int
    b: float
    c: str = pa.Field(nullable=True)  # For example, allow None values
    d: float    # US dollars

class OtherSchema(pa.DataFrameModel):
    year: int = pa.Field(ge=1900, le=2050)


def generate_data() -> DataFrame[MySchema]:
    df = pd.DataFrame({
        "a": [1, 2, 3],
        "b": [10.0, 20.0, 30.0],
        "c": ["A", "B", "C"],
        "d": [0.1, 0.2, 0.3],
    })

    # Runtime verification here, throws on schema mismatch
    strongly_typed_df = DataFrame[MySchema](df)
    return strongly_typed_df

def transform(input: DataFrame[MySchema]) -> DataFrame[OtherSchema]:
    # This demonstrates that you can use strongly
    # typed column names from the schema
    df = input.filter(items=[MySchema.a]).rename(
            columns={MySchema.a: OtherSchema.year}
    )

    return DataFrame[OtherSchema](df) # This will throw on range validation!


df1 = generate_data()
df2 = transform(df1)
transform(df2)   # mypy prints error here - incompatible type!

You can see mypy producing static type check error on the last line:

enter image description here

Discussion of advantages and limitations

With pandera we get –

  1. Clear and readable (dataclass style) DataFrame schema definitions and ability to use them as type hints.
  2. Run-time schema verification. Schema can define even more constraints than just types (see year in the example below and pandera docs for more).
  3. Experimental support for static type checking by mypy.

What we still miss –

  1. Full static type checking for column level verification.
  2. Any IDE support for column name auto-completion.
  3. Inline syntax for schema declaration, we have to explicitly define each schema as separate class before using it.

More examples

Pandera docs - https://pandera.readthedocs.io/en/stable/dataframe_models.html

Similar question - Type hints for a pandas DataFrame with mixed dtypes

Other typing projects

pandas-stubs is an active project providing type declarations for the pandas public API which is richer than type stubs included in pandas itself. But it doesn't provide any facilities for column level schemas.

There are quite a few outdated libraries related to this and pandas typing in general - dataenforce, data-science-types, python-type-stubs

pandera provides two different APIs that seem to be equally powerful - object-based API and class-based API. I demonstrate the later here.

vvv444
  • 2,764
  • 1
  • 14
  • 25
1

Arne is right, Python's type-hinting does not have any native out-of-the-box support for specifying col types in a Pandas DataFrame.

You can perhaps use comments with custom types

from typing import NamedTuple
import pandas as pd

class MyDataFrame(NamedTuple):
    a: int
    b: float
    c: str
    d: float  # US dollars as float

def generate_data(bunch, of, inputs) -> pd.DataFrame:
    """Massages the input to a nice and easy DataFrame.

    :return:
        DataFrame with columns a(int), b(float), c(string), d(us dollars as float)
    """
    # Your implementation here
    pass

This is a sample approach you could take. It defines a custom NamedTuple called MyDataFrame. Of course, it's not strictly type-hinting the DataFrame, and IDE and type-checking tools wont enforce it, but it provides a hint to the user about the expected struct of the output DataFrame.

An alternative approach you could take is using a custom type alias and docstring

from typing import Any, Dict
import pandas as pd

# Define a type alias for better documentation
DataFrameWithColumns = pd.DataFrame

def generate_data(bunch: Any, of: Any, inputs: Any) -> DataFrameWithColumns:
    """
    Massages the input to a nice and easy DataFrame.

    :param bunch: Description of the input parameter 'bunch'
    :param of: Description of the input parameter 'of'
    :param inputs: Description of the input parameter 'inputs'

    :return: DataFrame with columns:
        a (int): Description of column 'a'
        b (float): Description of column 'b'
        c (str): Description of column 'c'
        d (float): US dollars as float, description of column 'd'
    """
    # Your implementation here
    pass

Here, you could define a custom type alias for pd.DataFrame to represent the expectec output DataFrame, which could be helpful to end-users

Deepak Thomas
  • 3,355
  • 4
  • 37
  • 39
1

I'm not sure to fully understand what you expect. Isn't df.info() sufficient to help users?

>>> df.info()
<class '__main__.MyDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      float64
 2   c       3 non-null      object 
 3   d       3 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 224.0+ bytes

If not, you can subclassing DataFrame and override methods like info and __repr__. You can store additional informations in attrs dictionary and use it in these methods. Here an example:

class MyDataFrame(pd.DataFrame):
    
    def info(self):
        super().info()
        s = '\nMore information as footer:\n'
        s += self.attrs.get('more_info')
        print(s)

    def __repr__(self):
        s = 'More information as header:\n'
        s += f"{self.attrs.get('more_info')}\n\n"
        s += super().__repr__()
        return s
        
    @property
    def _constructor(self):
        return MyDataFrame

def generate_data(bunch, of, inputs) -> pd.DataFrame:
    df = MyDataFrame({'a': [0, 1, 2], 'b': [1.0, 1.1, 1.2],
                      'c': ['A', 'B', 'C'], 'd': [0.99, 2.49, 3.99]})
    df.attrs = {
        'more_info': 'Additional information here'
    }
    return df

df = generate_data('nothing', 'to', 'do')

Usage:

>>> df
More information as header:  # <- HERE
Additional information here  # <- HERE

   a    b  c     d
0  0  1.0  A  0.99
1  1  1.1  B  2.49
2  2  1.2  C  3.99
>>> df.info()
<class '__main__.MyDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      float64
 2   c       3 non-null      object 
 3   d       3 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 224.0+ bytes

More information as footer:  # <- HERE
Additional information here  # <- HERE
>>> df[['a', 'b']]
More information as header:
Additional information here

   a    b
0  0  1.0
1  1  1.1
2  2  1.2

I just used a simple string but you can have a more complex attrs structure and a special function to display this dict (check if columns exist and avoid display useless information). I hope this helps.

Corralien
  • 109,409
  • 8
  • 28
  • 52