49

I would like to create an empty DataFrame with a MultiIndex before assigning rows to it. I already found that empty DataFrames don't like to be assigned MultiIndexes on the fly, so I'm setting the MultiIndex names during creation. However, I don't want to assign levels, as this will be done later. This is the best code I got to so far:

def empty_multiindex(names):
    """
    Creates empty MultiIndex from a list of level names.
    """
    return MultiIndex.from_tuples(tuples=[(None,) * len(names)], names=names)

Which gives me

In [2]:

empty_multiindex(['one','two', 'three'])

Out[2]:

MultiIndex(levels=[[], [], []],
           labels=[[-1, -1, -1], [-1, -1, -1], [-1, -1, -1]],
           names=[u'one', u'two', u'three'])

and

In [3]:
DataFrame(index=empty_multiindex(['one','two', 'three']))

Out[3]:
one two three
NaN NaN NaN

Well, I have no use for these NaNs. I can easily drop them later, but this is obviously a hackish solution. Anyone has a better one?

dmvianna
  • 15,088
  • 18
  • 77
  • 106
  • 1
    Why do you want to do this? – Andy Hayden Feb 03 '15 at 06:22
  • @AndyHayden I'm trying to write a general enough function to handle arbitrary numbers of names. My assignment is to create frequency tables with very arbitrary and whimsical totals and subtotals and subsubtotals that can be folded and unfolded in a dashboard. Creating dataframes before passing them to Django makes my life easier. – dmvianna Feb 03 '15 at 06:29
  • Why do this as a MI rather than a columns? Generally pandas is pretty bad at updating on a row by row basis (as it has to copy the entirety of the data each time). Could you make it a MI later (after construction)? – Andy Hayden Feb 03 '15 at 06:35
  • @AndyHayden it is more convenient and readable to create labels by assignment (`df2.loc[(name, key2, True), :] = df1.loc[(key1, key2), :].sum()`) than to torture a `Series` before assignment by appending to it. And maintaining parallel DataFrames for Indexes and data would be even worse. – dmvianna Feb 03 '15 at 23:02
  • I think I would argue that a DataFrame may not be the right data structure to use in this case. – Andy Hayden Feb 03 '15 at 23:13
  • @AndyHayden I'm listening to suggestions. – dmvianna Feb 03 '15 at 23:14
  • Well, without knowing the precise specs it's hard to give the best solution, have you tried just using a dictionary? – Andy Hayden Feb 03 '15 at 23:39
  • 2
    @AndyHayden A dict won't give me pandas DataFrame indexing and methods such as sum() that I can combine with indexing. I agree that there could be a better solution (such as creating an object from scratch that does what I want). But at this point I'm optimising for developer time rather than processing time. – dmvianna Feb 05 '15 at 02:13

4 Answers4

56

The solution is to leave out the labels. This works fine for me:

>>> import pandas as pd
>>> my_index = pd.MultiIndex(levels=[[],[],[]],
...                          codes=[[],[],[]],
...                          names=[u'one', u'two', u'three'])
>>> my_index
MultiIndex([], names=['one', 'two', 'three'])
>>> my_columns = [u'alpha', u'beta']
>>> df = pd.DataFrame(index=my_index, columns=my_columns)
>>> df
Empty DataFrame
Columns: [alpha, beta]
Index: []
>>> df.loc[('apple','banana','cherry'),:] = [0.1, 0.2]
>>> df
                    alpha beta
one   two    three
apple banana cherry   0.1  0.2

For Pandas Version < 0.25.1: The keyword labels can be used in place of codes

RoG
  • 828
  • 9
  • 14
38

Another solution which is maybe a little simpler is to use the function set_index:

>>> import pandas as pd
>>> df = pd.DataFrame(columns=['one', 'two', 'three', 'alpha', 'beta'])
>>> df = df.set_index(['one', 'two', 'three'])
>>> df
Empty DataFrame
Columns: [alpha, beta]
Index: []
>>> df.loc[('apple','banana','cherry'),:] = [0.1, 0.2]
>>> df
                    alpha beta
one   two    three            
apple banana cherry   0.1  0.2
Jean Paul
  • 1,439
  • 18
  • 21
11

Using pd.MultiIndex.from_tuples may be more straightforward.

import pandas as pd
ind = pd.MultiIndex.from_tuples([], names=(u'one', u'two', u'three'))
df = pd.DataFrame(columns=['alpha', 'beta'], index=ind)
df.loc[('apple','banana','cherry'), :] = [4, 3]
df

                      alpha beta
one     two     three       
apple   banana  cherry    4    3
ronkov
  • 1,263
  • 9
  • 14
4

Using pd.MultiIndex.from_arrays allows for a slightly more concise solution when defining the index explicitly:

import pandas as pd
ind = pd.MultiIndex.from_arrays([[]] * 3, names=(u'one', u'two', u'three'))
df = pd.DataFrame(columns=['alpha', 'beta'], index=ind)
df.loc[('apple','banana','cherry'), :] = [4, 3]

                     alpha  beta
one   two    three              
apple banana cherry      4     3
mcsoini
  • 6,280
  • 2
  • 15
  • 38