0

This is just to share a very basic concept for beginners of multiindex dataframes.

I noticed empty items in the index column of a 2-index (multiindex) df. Though this must be the basics of multiindex dataframes, I was not familiar with it and had forgotten about it. I did not quickly notice the possible sense of this because I had very large numbers as index values where you do not even start to check out their significance. Sorting with df.sort_index(inplace=True) did not help to get rid of empty items either. It seemed to me at first sight that the dataset itself had partly empty lines for the first index. Searching for "empty items of a multiindex" did not help either. That is why I want to share this very simple problem with other beginners of multiindex dataframes.

Here are the "empty items" in index column 'A_idx':

A_idx B_idx
12344 12345   0.289163 -0.464633 -0.060487
      12345   0.224442  0.177609  2.156436
12346 12346  -0.262329 -0.248384  0.925580
12347 12347   0.051350  0.452014  0.206809
      12348   2.757255 -0.739196  0.183735
      12349  -0.064909 -0.963130  1.364771
12350 12351  -1.330857  1.881588 -0.262170
questionto42
  • 7,175
  • 4
  • 57
  • 90
  • This is just a display feature to make it visually easier to separate groups. If you sort by one of your columns you'll quickly see that the "empty" indices will fill themselves in as long as they are no longer consecutively repeating – fantabolous Apr 27 '23 at 06:51
  • @fantabolous yes, right, just as shared in the answer. – questionto42 Apr 27 '23 at 15:38

1 Answers1

2

This is just to share a very basic concept for beginners of multiindex dataframes.

The "empty" items are part of the multiindex view and only appear when you output a df, it is helping you to understand the hierarchy. If you output the isolated Multiindex class, no item will be empty. Thus, the index items are never really empty, and the "empty" fields only appear for df ouputs:

  • If the "A_idx" index is assigned to more than one "B_idx" index value, the "A_idx" index is not repeated, because it is the parent.
  • If "A_idx" index points to more than one value row while "B_idx" index is repeating, B_idx is still repeated, because it is the child.

If you take the df.head(10) and find out that the "empty" index item is in line 1, you can also check this quickly in your df using df.iloc[1].reset_index(). You will see that the index item is not empty.

In the following, "first" and "second" are index names seemingly with equal rights to be either parent as they are on the same output line, but in reality the hierarchy goes from left to right.

first second
bar   one     0.289163 -0.464633 -0.060487
      two     0.224442  0.177609  2.156436
baz   one    -0.262329 -0.248384  0.925580
foo   one     0.051350  0.452014  0.206809
      two     2.757255 -0.739196  0.183735
      three  -0.064909 -0.963130  1.364771
qux   one    -1.330857  1.881588 -0.262170

Thanks for the example go to Access last elements of inner multiindex level in pandas dataframe.

This actually means:

first second
bar   one     0.289163 -0.464633 -0.060487
bar   two     0.224442  0.177609  2.156436
baz   one    -0.262329 -0.248384  0.925580
foo   one     0.051350  0.452014  0.206809
foo   two     2.757255 -0.739196  0.183735
foo   three  -0.064909 -0.963130  1.364771
qux   one    -1.330857  1.881588 -0.262170

####

Example of how to create a hierarchy accordingly.

The order of the column list that is passed to the set_index() creates the hierarchy in the same order.

You can check this out in a small example I borrowed from pandas multiindex reindex by rows, with df2 covering a switch of the two indices. Only df shows the secret "empty items", see df vs. df2 output:

df = pd.DataFrame({'month': [1, 4, 7, 10],
                'year': [2012, 2012, 2013, 2013],
                'sale': [55, 40, 84, 31]})
df2 = df.copy()

df=df.set_index(['year','month'])
df2=df2.set_index(['month','year'])

df:

            sale
year month      
2012 1        55
     4        40
2013 7        84
     10       31

df2:

   month  year  sale
0      1  2012    55
1      4  2012    40
2      7  2013    84
3     10  2013    31

df.index

Output:

MultiIndex([(2012,  1),
            (2012,  4),
            (2013,  7),
            (2013, 10)],
           names=['year', 'month'])

Or:

df2.index

Output:

MultiIndex([( 1, 2012),
            ( 4, 2012),
            ( 7, 2013),
            (10, 2013)],
           names=['month', 'year'])

Have a look at the levels in the df:

df.index.levels[0]

Int64Index([2012, 2013], dtype='int64', name='year')

df.index.levels[1]

Int64Index([1, 4, 7, 10], dtype='int64', name='month')

df2.index.levels[0]

Int64Index([1, 4, 7, 10], dtype='int64', name='month')

df2.index.levels[1]

Int64Index([2012, 2013], dtype='int64', name='year')

If you want to check or clarify the different levels of the hierarchy in the output view, choose one row and reset the index:

df.iloc[1].reset_index()

Output:

  index 2012
           4
0  sale   40

Or:

df2.iloc[1].reset_index()

Output:

  index    4
        2012
0  sale   40
questionto42
  • 7,175
  • 4
  • 57
  • 90
  • The same seemingly "empty" (but simply repeated) items can appear after groupby statement, see for example https://stackoverflow.com/questions/45235992/what-are-levels-in-a-pandas-dataframe. – questionto42 Aug 28 '20 at 12:53