4

I am taking a Data Science course about data analysis in Python. At one point in the course the professor says:

You can chain operations together. For instance, we could have rewritten the query for all Store 1 costs as df.loc['Store 1']['Cost']. This looks pretty reasonable and gets us the result we wanted. But chaining can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting data, this is not a big deal, though it might be slower than necessary. If you are changing data though, this is an important distinction and can be a source of error.

Later on, he describes chain indexing as:

Generally bad, pandas could return a copy of a view depending upon NumPy

So, he suggests using multi-axis indexing (df.loc['a', '1']).

I'm wondering whether it is always advisable to stay clear of chain indexing or are there specific uses cases for it where it shines?

Also, if it is true that it can return a copy of a view or a view (depending upon NumPy), what exactly does it depend on and can I influence it to get the desired outcome?

I've found this answer that states:

When you use df['1']['a'], you are first accessing the series object s = df['1'], and then accessing the series element s['a'], resulting in two __getitem__ calls, both of which are heavily overloaded (handle a lot of scenarios, like slicing, boolean mask indexing, and so on).

...which makes it seem chain indexing is always bad. Thoughts?

Glorius
  • 183
  • 2
  • 12
  • 3
    No. Do not use chained indexing. IMHO, I thinking chained indexing is always bad. – Scott Boston Jun 26 '19 at 13:40
  • @ScottBoston Well, if you're right that's really disappointing :D Thank you for your insight. – Glorius Jun 26 '19 at 13:44
  • 2
    This is an opinion and I offer it as advice. This question question should be closed though. There is no convenient hard and fast rule to live by to determine when a resulting array will be a view or not. As a matter of fact, even using `loc` across both axes doesn't guarantee anything `df = pd.DataFrame(1, [1, 2], ['A', 'B']); d = df.iloc[:, :1]; d.loc[:] = 2`. If you are terribly worried about views, use methods that produce copies. This comment isn't an appropriate place to describe all such methods. That said, you can use chained indexing when you need the values.... to be cont – piRSquared Jun 26 '19 at 13:47
  • 2
    To argue about the overhead of a `__getitem__` call when there is SOOO many other things going on under the hood seems ridiculous. `df['col_name']` is super cheap. `df['col_name']['col_element']` should be fine. Is it "best" practice, maybe not. But seriously, it doesn't matter. Chained indexing comes in handy when you need to first slice by labels and then by position. `df['col_name'].iloc[3:10]`. Also, you shouldn't fear using it if you are looking to **USE** the values. If you are looking to **ASSIGN** to the positions, that's when you need to be careful. That's it for now (-: – piRSquared Jun 26 '19 at 13:50

0 Answers0