0

How do I implement the syntax for filtering dataframes in Pandas? (df[df.column1 > someValue])

I am trying to make a class that have the same syntax of Pandas when filtering dataframes.

How do I replicate the syntax for a Dataframe df = DataFrame(someData) like this one:

df[df.column1 > someValue]

I implemented the methods __getattr__ and __getitem__ for the syntaxes of

df.column1 
df['column1']

But I don't know how to link both together. Also, I could not find the function to copy from Pandas code.

Either an implementation to this problem or the reference to the function in Pandas would be of great help.

Edit:(Solution)

Following the hint on the answers I implemented the __getitem__ function as follows:

from tier tools import compress

def __getitem__(self, name):
    """Get items with [ and ]
    """
    #If there is no expression, return a column
    if isinstance(name, str):
      return self.data[name]

    #if there was an expression return the dataframe filtered
    elif isinstance(name, list):
      ind = list(compress(range(len(name)), name))
      temp = DataFrame([[self.data[c].values[i] 
                            for i in ind] 
                           for c in self.columns],
                           columns=self.columns)
      return temp

Note that I also had to implement the comparison methods for my column class (Series). The full code can be seen here.

zeh
  • 1,197
  • 2
  • 14
  • 29

2 Answers2

1

You need to implement __getitem__ to take a list of booleans and only return items when True. You will also need to implement the conditional operators (>, ==, etc.) to return that list of booleans, e.g. (proof of concept code):

class A(object):
    def __init__(self, data):
        self.data = data
    def __getitem__(self, key):
        return A([d for k, d in zip(key, self.data) if k])
    def __gt__(self, value):
        return [d > value for d in self.data]
    def __repr__(self):
        return str(self.__class__) + ' [' + ', '.join(str(d) for d in self.data) + ']'

>>> a = A(list(range(20)))
>>> a
<class '__main__.A'> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> a[a > 5]
<class '__main__.A'> [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
AChampion
  • 29,683
  • 4
  • 59
  • 75
1

I think you basically want something that just wraps a recarray or structured array.

import numpy as np

myarray = np.array([("Hello",2.5,3),
                        ("World",3.6,2),
                        ('Foobar',2,7)]).T

df = np.core.records.fromarrays(myarray, 
                             names='column1, column2, column3',
                             formats = 'S8, f8, i8')

print(df)
print(df[df.column3<=3])

While I don't use Pandas myself, the DataFrame seems like it is very similar to a recarray. If you wanted to roll your own, be sure to read about subclassing ndarray. numpy arrays can also be indexed with boolean mask variables such as

myarray = np.array([(1,2.5,3.),
                        (2,3.6,2.),
                        (3,2,7.)])
print(myarray[myarray[:,2]<=3.])
JonB
  • 350
  • 1
  • 7