3

Coming from R background, I find the (very high) use of Index objects in pandas a little disconcerting. For example, if train is a pandas DataFrame, is there some special reason why train.columns should return an Index rather than a list? What purpose would additionally be served if it is an Index object? As per the definition of pandas.Index, it is the basic object storing axis labels for all pandas objects. While train.index.values does return the row labels (axis=0), how can I get column labels or columns names from pandas.index? In this question unlike in an earlier question, I have a specific example in mind.

Ashok K Harnal
  • 1,191
  • 2
  • 15
  • 28

2 Answers2

5

A pd.Index is an array-like container of the column names, so in some sense it doesn't make sense to ask how to get the labels from the index, because the index is the labels.

That said, you can always get the underlying numpy array with df.columns.values, or convert to a python list with tolist() as @Mitch showed.

In terms of why an index is used over a bare array - an Index provides extra functionality/performance used throughout pandas - the core of which is hash table based indexing.

By example, consider the following frame / columns.

df = pd.DataFrame(np.random.randn(10, 10),
                  columns=list('abcdefghkm'))

cols = df.columns

cols
Out[16]: Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'm'], dtype='object')

Now say you want to select column 'h' out of the frame. With a list or array version of the columns, you would have loop over the columns to find the position of 'h', which is O(n) in the number of columns - something like this:

for i, col in enumerate(cols):
    if col == 'h':   
        found_loc = i
        break

found_loc
Out[18]: 7

df.values[:, found_loc]
Out[19]: 
array([-0.62916208,  2.04403495,  0.29498066,  1.07939374, -1.49619915,
       -0.54592646, -1.04382192, -0.45934113, -1.02935858,  1.62439231])

df['h']
Out[20]: 
0   -0.629162
1    2.044035
2    0.294981
3    1.079394
4   -1.496199
5   -0.545926
6   -1.043822
7   -0.459341
8   -1.029359
9    1.624392
Name: h, dtype: float64

With the Index, pandas constructs a hash table of the column values, so finding the location of 'h' is an amortized O(1) operation, generally significantly faster, especially if the number of columns is significant.

df.columns.get_loc('h')
Out[21]: 7

This example was only selecting a single column, but as @ayhan notes in the comment, this same hash table structure also speeds up many other operations like merging, alignment, filtering, and grouping.

chrisb
  • 49,833
  • 8
  • 70
  • 70
3

From the documentation for pandas.Index

Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects

Having a regular list as an index for a DataFrame could cause issues with unorderable or unhashable objects, evidently - since it is backed by a hash table, the same principles apply as to why lists can't be dictionary keys in regular Python.

At the same time, the Index object being explicit permits us to use different types as an Index, as compared to the implicit integer index that NumPy has for instance, and perform fast lookups.

If you want to retrieve a list of column names, the Index object has a tolist method.

>>> df.columns.tolist()
['a', 'b', 'c']
miradulo
  • 28,857
  • 6
  • 80
  • 93
  • Will be grateful if you can please expand upon the statement ' Having a regular list as an index for a DataFrame could cause issues with unorderable or unhashable objects, evidently. '. (Maybe there is an example.) Thanks. – Ashok K Harnal Sep 14 '17 at 14:14
  • 1
    @user3282777 An index is like a mapping to the DataFrame columns, sort of like a Python dict. So the same principles apply as for why you can't have mutable types as dict keys in regular Python, which the [Python wiki](https://wiki.python.org/moin/DictionaryKeys) has a useful bit on. – miradulo Sep 14 '17 at 14:19