2

I would like to use in Python something akin to -- or better than -- R arrays. R arrays are tensor-like objects with a dimnames attribute, which allows to straightforwardly allows to subset tensors based on names (strings). In numpy recarrays allow for column names, and pandas for flexible and efficient subsetting of 2-dimensional arrays. Is there something in Python that allows similar operations as slicing and subsetting of ndarrays by using names (or better, objects that are hashable and immutable in Python)?

gappy
  • 10,095
  • 14
  • 54
  • 73
  • is this what you are looking for: [pandas Panel](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Panel.html) – elyase Apr 05 '14 at 17:03
  • No. pandas.panel simply returns a long representation on a wide matrix (or in reshape-like jargon, it melts it). I am looking for a genuine tensor object with named axis labels. – gappy Apr 05 '14 at 17:10

1 Answers1

3

How about this quick and dirty mapping from lists of strings to indices? You could clean up the notation with callable classes.

def make_dimnames(names):
    return [{n:i for i,n in enumerate(name)} for name in names]
def foo(d, *args):
    return [d[x] for x in args]

A = np.arange(9).reshape(3,3)
dimnames = [('x','y','z'),('a','b','c')]
Adims = make_dimnames(dimnames)
A[foo(Adims[0],'x','z'),foo(Adims[1],'b')]  # A[[0,2],[1]]
A[foo(Adims[0],'x','z'),slice(*foo(Adims[1],'b','c'))]  # A[[0,2],slice(1,2)]

Or does R do something more significant with the dimnames?

A class compresses the syntax a bit:

class bar(object):
    def __init__(self,dimnames):
        self.dd = {n:i for i,n in enumerate(dimnames)}
    def __call__(self,*args):
        return [self.dd[x] for x in args]
    def __getitem__(self,key):
        return self.dd[key]
d0, d1 = bar(['x','y','z']), bar(['a','b','c'])
A[d0('x','z'),slice(*d1('a','c'))]

http://docs.scipy.org/doc/numpy/user/basics.subclassing.html sublassing ndarray, with simple example of adding an attribute (which could be dinnames). Presumably extending the indexing to use that attribute shouldn't be hard.

Inspired by the use of __getitem__ in numpy/index_tricks, I've generalized the indexing:

class DimNames(object):
    def __init__(self, dimnames):
        self.dd = [{n:i for i,n in enumerate(names)} for names in dimnames]
    def __getitem__(self,key):
        # print key
        if isinstance(key, tuple):
            return tuple([self.parse_key(key, self.dd[i]) for i,key in enumerate(key)])
        else:
            return self.parse_key(key, self.dd[0])
    def parse_key(self,key, dd):
        if key is None:
            return key
        if isinstance(key,int):
            return key
        if isinstance(key,str):
            return dd[key]
        if isinstance(key,tuple):
            return tuple([self.parse_key(k, dd) for k in key])
        if isinstance(key,list):
            return [self.parse_key(k, dd) for k in key]
        if isinstance(key,slice):
            return slice(self.parse_key(key.start, dd),
                         self.parse_key(key.stop, dd),
                         self.parse_key(key.step, dd))
        raise KeyError

dd = DimNames([['x','y','z'], ['a','b','c']])

print A[dd['x']]              # A[0]
print A[dd['x','c']]          # A[0,2]
print A[dd['x':'z':2]]        # A[0:2:2]
print A[dd[['x','z'],:]]      # A[[0,2],:]
print A[dd[['x','y'],'b':]]   # A[[0,1], 1:]
print A[dd[:'z', :2]]         # A[:2,:2]

I suppose further steps would be to subclass A, add dd as attribute, and change its __getitem__, simplifying the notation to A[['x','z'],'b':].

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • R doesn't do more than this when it comes to subsetting. In fact your second example is doing something R can't do. But, for most examples R's syntax is cleaner (which I am not able to reproduce in Python -- my bad] and no need to carry an Adims object; i.e., A[['x','z'], 'b'], or A[['x','z'], ['b']]. My second concern is that when stacking tensors, R takes care of attributes, and this approach wouldn't. My third concern is that I am against reinventing the wheel, and would rather use something already well established. – gappy Apr 05 '14 at 21:50
  • I've added more power to `__getitem__`, so it handles slices, tuples, etc directly. It tries to duplicate the existing `numpy` getitem, with the addition of the dictionary lookup in case of strings. – hpaulj Apr 06 '14 at 22:52