151

Let's take a simple function that takes a str and returns a dataframe:

import pandas as pd
def csv_to_df(path):
    return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

What is the recommended pythonic way of adding type hints to this function?

If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame. The following won't work though, as it'll tell me that pandas is not defined.

 def csv_to_df(path: str) -> pandas.core.frame.DataFrame:
     return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Daniel
  • 11,332
  • 9
  • 44
  • 72
  • 1
    But you're using the `pd` alias, and you can probably define custom types. – Moses Koledoye May 10 '17 at 11:15
  • @MosesKoledoye if I try pd.core.frame.DataFrame I'll get an `AttributeError` instead of a `NameError`. – Daniel May 10 '17 at 11:16
  • I am not an authority on "pythonicity" but I would recommend doc-strings (using `''' this function takes a inputType and returns an outputType '''`) this is also what will be shown if someone calls `help(yourFunction)` function on your function. – Chris May 10 '17 at 11:22
  • 4
    the library `dataenforce` allows to check for data types inside the data frame https://github.com/CedricFR/dataenforce – 00schneider Apr 21 '20 at 13:49
  • Related on r/learnpython: [How to specify pandas type-hint with columns](https://www.reddit.com/r/learnpython/comments/103xp7l/how_to_specify_pandas_typehint_with_columns/) – starball Apr 15 '23 at 02:23

6 Answers6

232

Why not just use pd.DataFrame?

import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
    return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Result is the same:

> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame
Edgar Ortega
  • 1,672
  • 18
  • 22
Georgy
  • 12,464
  • 7
  • 65
  • 73
  • 24
    it also won't allow specifying dtypes for specific columns, which could be extremely useful – Philipp_Kats Sep 10 '19 at 18:46
  • 8
    @Philipp_Kats Currently there is no way to specify dtypes for DataFrame columns in type hints, and [I haven't seen](https://github.com/pandas-dev/pandas/issues/25601) any work done in this direction (correct me if I'm wrong). Linking a related question on type hints with NumPy and dtypes: [*Type hint for NumPy ndarray dtype?*](https://stackoverflow.com/q/54503964). You will see that it's also [not implemented there yet](https://github.com/numpy/numpy-stubs/issues/7). – Georgy Sep 10 '19 at 19:59
  • 2
    This gives an error in mypy `error: No library stub file for module 'pandas'` – user2304916 Nov 08 '19 at 00:06
  • @user2304916 See [Unable to suppress `No library stub file for module...` error](https://github.com/python/mypy/issues/3905). – Georgy Nov 08 '19 at 08:57
  • `pd.DataFrame` doesn't tell much unfortunately. The underlying df could have literally any shape and you wouldn't know. – async Jul 10 '22 at 09:22
  • Learning about the shape and column names / types of a dataframe is different than knowing that the type of the object is a dataframe. Consider the difference between making sure a variable is an int, and making sure that int is greater than 3. Its the same distinction. – Nesha25 Apr 17 '23 at 14:53
  • 1
    @Nesha25 It is also similar to the difference between a `list[int]` and `list[str]`. Without the type parameter telling you what's "inside" the list, you don't really know what you can legally do with the contents. The same applies to dataframes. Additionally, your example of an "int greater than 3" is unusual indeed, but such "value constraints" are not so unusual in type systems - consider for example a "a non-null pointer", "non-zero divisor" or "object with validated email address". Such types are used in many places. – jonaslb May 01 '23 at 15:21
29

I'm currently doing the following:

from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
    return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Which gives:

> help(csv_to_df)
Help on function csv_to_df in module __main__:

csv_to_df(path:str) -> ~pandas.core.frame.DataFrame

Don't know how pythonic that is, but it's understandable enough as a type hint, I find.

Daniel
  • 11,332
  • 9
  • 44
  • 72
  • 29
    @Azat Ibrakov would you mind elaborating on your comment? Sometimes I'm not sure what is and isn't 'pythonic'. – Tom Roth Apr 19 '18 at 05:34
  • 5
    I see people downvoting this answer. For context, this was the solution I found for my own question, and for all intents and purposes it works just fine. The more pythonic solution above, which I accepted as correct answer (but does have its own perks, see comments), was only provided 8 months afterwards. – Daniel Nov 12 '19 at 17:47
  • 5
    It's not pythonic since it is less clear and harder to maintain than the accepted answer for this question. Since the type path here is not verified by the compiler it won't raise errors if it's wrong. This could happen from a typo in your `TypeVar` arg or change to the module itself. – Alex Apr 17 '20 at 16:27
  • 4
    I receive a warning when I use this: `The argument to 'TypeVar()' must be a string equal to the variable name to which it is assigned ` – Victor M Perez Feb 04 '21 at 10:35
  • @Azat Ibrakov These "pythonic" and "not pythonic" arguments are like a mantra for many "Pythonists". I think we should stop arguments in this style. A had never heard this type of argumentation from e.g. Java developer. In my opinion, there is nothing wrong with this solution. – uetoyo Aug 16 '21 at 07:18
  • 5
    This is not the correct use of a type variable. A `TypeVar` exists to link two types together ([mypy docs](https://mypy.readthedocs.io/en/stable/generics.html)). You probably meant a type _alias_: `PandasDataFrame = pandas.core.frame.DataFrame` – decorator-factory May 31 '22 at 08:39
19

Now there is a pip package that can help with this. https://github.com/CedricFR/dataenforce

You can install it with pip install dataenforce and use very pythonic type hints like:

def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
    pass
luksfarris
  • 1,313
  • 19
  • 38
7

Check out the answer given here which explains the usage of the package data-science-types.

pip install data-science-types

Demo

# program.py

import pandas as pd

df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment

Run using mypy the same way:

$ mypy program.py

kevin_theinfinityfund
  • 1,631
  • 17
  • 18
  • 1
    Unfortunately, this is buried at bottom. **In 2021 this is the best answer.** Note too the comment by Daniel Malachov following the linked answer (https://stackoverflow.com/a/63446142/8419574). – user3897315 Nov 17 '21 at 01:48
  • 6
    @user3897315 - I disagree that this is the best answer in 2021. If you visit [data-science-types on GitHub](https://github.com/predictive-analytics-lab/data-science-types) you'll find the repository has been archived, and the README updated (on Feb 16 2021) with the following note: "⚠️ **this project has mostly stopped development** ⚠️ The pandas team and the numpy team are both in the process of integrating type stubs into their codebases, and we don't see the point of competing with them." – blthayer Dec 03 '21 at 22:45
  • 1
    I agree, but following that I don't see a timeline when pandas or numpy will have these pushed or ETA in their roadmap. – kevin_theinfinityfund Dec 08 '21 at 04:18
6

Take a look at pandera.

pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust. Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings.


The advantage of pandera is that you can also specify dtypes of individual DataFrame columns. The following example uses pandera to run-time enforce a DataFrame containing a single column of integers:

import pandas as pd
import pandera
from pandera.typing import DataFrame, Series

class Integers(pandera.SchemaModel):
    number: Series[int] 

@pandera.check_types
def my_fn(a: DataFrame[Integers]) -> None:
    pass

# This works
df = pd.DataFrame({"number": [ 2002, 2003]})
my_fn(df)

# Raises an exception
df = pd.DataFrame({"number": [ 2002.0, 2003]})
my_fn(df)

# Raises an exception
df = pd.DataFrame({"number": [ '2002', 2003]})
my_fn(df)
user64150
  • 59
  • 5
Dvir Berebi
  • 1,406
  • 14
  • 25
1

This is straying from the original question but building off of @dangom's answer using TypeVar and @Georgy's comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:

from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
    return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Keith
  • 154
  • 1
  • 8