I am taking a Data Science course about data analysis in Python. At one point in the course the professor says:
You can chain operations together. For instance, we could have rewritten the query for all Store 1 costs as df.loc['Store 1']['Cost']. This looks pretty reasonable and gets us the result we wanted. But chaining can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting data, this is not a big deal, though it might be slower than necessary. If you are changing data though, this is an important distinction and can be a source of error.
Later on, he describes chain indexing as:
Generally bad, pandas could return a copy of a view depending upon NumPy
So, he suggests using multi-axis indexing (df.loc['a', '1']
).
I'm wondering whether it is always advisable to stay clear of chain indexing or are there specific uses cases for it where it shines?
Also, if it is true that it can return a copy of a view or a view (depending upon NumPy), what exactly does it depend on and can I influence it to get the desired outcome?
I've found this answer that states:
When you use df['1']['a'], you are first accessing the series object s = df['1'], and then accessing the series element s['a'], resulting in two __getitem__ calls, both of which are heavily overloaded (handle a lot of scenarios, like slicing, boolean mask indexing, and so on).
...which makes it seem chain indexing is always bad. Thoughts?