4

In Pandas, I've been using custom objects as column labels because they provide rich/flexible functionality for info/methods specific to the column. For example, you can set a custom fmt_fn to format each column (note this is just an example, my actual column label objects are more complex):

In [100]: class Col:
     ...:     def __init__(self, name, fmt_fn):
     ...:         self.name = name
     ...:         self.fmt_fn = fmt_fn
     ...:     def __str__(self):
     ...:         return self.name
     ...:     

In [101]: sec_col = Col('time', lambda val: str(timedelta(seconds=val)).split('.')[0])

In [102]: dollar_col = Col('money', lambda val: '${:.2f}'.format(val))

In [103]: foo = pd.DataFrame(np.random.random((3, 2)) * 1000, columns = [sec_col, dollar_col])

In [104]: print(foo)  # ugly
         time       money
0  773.181402  720.997051
1   33.779925  317.957813
2  590.750129  416.293245

In [105]: print(foo.to_string(formatters = [col.fmt_fn for col in foo.columns]))  # pretty
     time   money
0 0:12:53 $721.00
1 0:00:33 $317.96
2 0:09:50 $416.29

Okay, so I've been happily doing this for a while, but then I recently came across one part of Pandas that doesn't support this. Specifically, methods to_hdf/read_hdf will fail on DataFrames with custom column labels. This is not a dealbreaker for me. I can use pickle instead of HDF5 at the loss of some efficiency.

But the bigger question is, does Pandas in general support custom objects as column labels? In other words, should I continue to use Pandas this way, or will this break in other parts of Pandas (besides HDF5) in the future, causing me future pain?

PS. As a side note, I wouldn't mind if you also chime in on how you solve the problem of column-specific info such as the fmt_fn in the example above, if you're not currently using custom objects as column labels.

Community
  • 1
  • 1
aiai
  • 525
  • 4
  • 11
  • Interesting question, as I've never seen objects passed as columns in a DataFrame. I would recommend against this usage. If you need the flexibility, you can keep a dictionary of column names and underlying objects. – Alexander Sep 01 '15 at 21:18
  • It would be bad design (IMO) to maintain a separate data structure per DataFrame that's parallel to `foo.columns` rather than simply put the column-specific data into `foo.columns`. I would only do so if necessary, i.e. if Pandas really does not support custom objects as column labels. Hence I posted this question. – aiai Sep 01 '15 at 22:44
  • The columns of a dataframe are just an Index. It appears that the only requirement is that the objects are hashable. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html – Alexander Sep 01 '15 at 22:50

2 Answers2

2

Fine-grained control of formatting of a DataFrame isn't really possible right now. E.g., see here or here for some discussion of possibilities. I'm sure a well thought out API (and PR!) would be well received.

In terms of using custom objects as columns, the two biggest issues are probably serialization, and indexing semantics (e.g. can no longer do df['time']).

One possible work-around would be to wrap your DataFrame is some kind of pretty-print structure, like this:

In [174]: class PrettyDF(object):
     ...:     def __init__(self, data, formatters):
     ...:         self.data = data
     ...:         self.formatters = formatters
     ...:     def __str__(self):
     ...:         return self.data.to_string(formatters=self.formatters)
     ...:     def __repr__(self):
     ...:         return self.__str__()


In [172]: foo = PrettyDF(df, 
                        formatters={'money': '${:.2f}'.format, 
                                    'time': lambda val: str(timedelta(seconds=val)).split('.')[0]})


In [178]: foo
Out[178]: 
     time   money
0 0:13:17 $399.29
1 0:08:48 $122.44
2 0:07:42 $491.72

In [180]: foo.data['time']
Out[180]: 
0    797.699511
1    528.155876
2    462.999224
Name: time, dtype: float64
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • As I noted in my question post, the `fmt_fn` is just for example of column-specific data. My actual column label objects are much more complex, providing much richer functionality than output formatting. – aiai Sep 02 '15 at 03:31
  • As far as the "two biggest issues" you listed: (a) serialization is the one issue that I did run into, that prompted me to write this question -- hopefully I can handle it with pickle. (b) "can no longer do `df['time']`" would not be considered an issue in my book because `'time'` is not the column label object (merely the printed representation of it) -- the correct code is `df[sec_col]` and that works correctly as expected. Given your comments and Alexander's comment above, I think my current conclusion is that it's safe to continue using custom objects in column labels. Thanks! – aiai Sep 02 '15 at 03:40
  • 1
    I tried the same thing with `__str__`, but it does not work for MultiIndexed dataframes. Do you have solution for that as well? https://stackoverflow.com/questions/49563981/pandas-custom-class-as-column-header-with-multi-indexing – Nima Mousavi Mar 29 '18 at 20:04
0

It's been five years since this was posted, so i hope this is still helpfull to someone. I've managed to build an object to hold metadata for a pandas dataframe column but still be accessable as a regular column (or so it seems to me). The code below is just the part of the whole class that involves this.

__repr is for presenting the name of the object if the dataframe is printed instead of the object

__eq is for checking the requested name to the available name of the objects __hash is also used in this process Column-names need to be hashable as it works the similar to a dictionary.

Thats probably not pythonic way of descibing it, but seems to me like thats the way it works.

    class ColumnDescriptor:
        def __init__(self, name, **kwargs):
            self.name = name
            [self.__setattr__(n, v) for n, v in kwargs.items()]
    
        def __repr__(self): return self.name
        def __str__(self): return self.name
        def __eq__(self, other): return self.name == other
        def __hash__(self): return hash(self.name)