Pandas .loc and PEP8

Question

I've tried to search this a number of times but I don't see it answered so here goes...

I often use pandas to clean up a dataframe and conform it to my needs. With this comes a lot of .loc accessing to query it and return values. Depending on what I am doing (and column lengths), this can get pretty lengthy. Given PEP8 constrains to 79 characters a line, are there any best practices? Some examples below (these are simplified and for explanatory purposes):

missing_address_df = address_df.loc[address_df['address'].notnull()].copy()

or multiple query points:

nc_drive_df = address.loc[(address_df['address'].str.contains('drive')) & (address_df['state'] == 'NC')]

You can split chained method calls onto multiple lines - I'm pretty sure there are examples in PEP8 itself. — MattDMo, Oct 24 '20 at 19:14
@MattDMo yes, I've looked some at chain method calls. However, I am referring more to the case when the .loc line is so long, that itself spills far beyond 79 characters. ex - address_to_be_df['address name column'] = address_to_be_df['address name columns'].astype('int'). In this example, I am not really method chaining. I am wondering if there are more visually appealing ways to break out the variable selection. — Tom Watson, Oct 24 '20 at 19:23

ti7 · Accepted Answer · 2023-07-24T16:06:51.927

I'd advise two things

Ignore PEP 8's 80 char advice, but try to keep to 120 or 150 lines
Keeping some line length requirement makes sense to aid readability, but if you're trying to keep to 80 chars in (for example) a class method, it will lead to worse and less-readable code

PEP 8 actually has a section on this, A Foolish Consistency is the Hobgoblin of Little Minds, which describes cases you should deviate from its other advice, for example
1. When applying the guideline would make the code less readable, even for someone who is used to reading code that follows this PEP

split the .loc contents onto multiple lines

nc_drive_df = address.loc[
    (address_df['address'].str.contains('drive')) & \
    (address_df['state'] == 'NC')
]

It's hard to be objective about when code "looks bad", despite being valid syntax, but you will experience it. Practically, PEP 8 and Cyclomatic Complexity checkers are tools which will help you fight against and defend and propose code styles in a scientific way.

If you have a great many boolean statements, you (often must) break them up with parentheses to clarify their order

nc_drive_df = address.loc[
    (
        (address_df['address'].str.contains('drive')) & \
        (address_df['state'] == 'NC')
    ) || (
        address_df['zip'] == "00000"
    )
]

This is somewhat in conflict with conventional Python operators, which are suggested to preceed lines (PEP8), but I challenge this when forming a boolean array because the member Series must be the same shape to get a good result and it's generally easier to observe and reason about them when working with several DataFrames when they're visually aligned.

Finally, often when doing scientific Python, you should absolutely try many possibilities against a partial and full data if possible to draw good performance conclusions, consider their readability to be second, and provide excellent comments about and linking to your research, etc. over any particular style.

True - some linters request it, while it can also interrupt some searching (invalidates `\s*`)! — ti7, Oct 24 '20 at 19:48
Any linter that requires a `\` in this situation is a bad linter. — Marius Gedminas, Oct 31 '20 at 18:27
Also, keep the operator '&' to the left of whatever it is operating on — Ricardo Udenze, Mar 05 '21 at 16:30
@RicardoUdenze that's absolutely a personal style choice and not one I would recommend - if there's a variety of operators, explicitly [provide parentheses to clarify the desired order of operations](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing) — ti7, Mar 05 '21 at 16:40

Pandas .loc and PEP8

1 Answers1