5

In pandas you can replace the default integer-based index with an index made up of any number of columns using set_index().

What confuses me, though, is when you would want to do this. Regardless of whether the series is a column or part of the index, you can filter values in the series using boolean indexing for columns, or xs() for rows. You can sort on the columns or index using either sort_values() or sort_index().

The only real difference I've encountered is that indexes have issues when there are duplicate values, so it seems that using an index is more restrictive, if anything.

Why then, would I want to convert my columns into an index in Pandas?

Migwell
  • 18,631
  • 21
  • 91
  • 160

2 Answers2

2

In my opinion custom indexes are good for quickly selecting data.

They're also useful for aligning data for mapping, for aritmetic operations where the index is used for data alignment, for joining data, and for getting minimal or maximal rows per group.

DatetimeIndex is nice for partial string indexing, for resampling.

But you are right, a duplicate index is problematic, especially for reindexing.

Docs:

  • Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
  • Enables automatic and explicit data alignment
  • Allows intuitive getting and setting of subsets of the data set

Also you can check Modern pandas - Indexes, direct link.

Migwell
  • 18,631
  • 21
  • 91
  • 160
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Good call, citing the index documentation. I'd never even noticed that paragraph – Migwell Jun 20 '17 at 07:15
  • Thank you. I try add all ways what sometimes used, but I believe there is more situtations where seting index is necessary. – jezrael Jun 20 '17 at 07:18
2

As of 0.20.2, some methods, such as .unstack(), only work with indices.

Custom indices, especially indexing by time, can be particularly convenient. Besides resampling and aggregating over any time interval (the latter is done using .groupby() with pd.TimeGrouper()) which require a DateTimeIndex, you can call the .plot() method on a column, e.g. df['column'].plot() and immediately get a time series plot.

The most useful though, is alignment: for example, suppose you had some two sets of data that you want to add; they're labeled consistently, but sorted in a different order. If you set their labels to be the index of their dataframe, you can simply add the dataframes together and not worry about the ordering of the data.

Ken Wei
  • 3,020
  • 1
  • 10
  • 30