0

How in Python to organize and filter a collection of objects by a field value? I need to filter by being equal to an exact value and by being less than a value.

And how to do it effectively? If I store my objects in a list I need to iterate over a whole list, potentially holding hundreds of thousands of objects.

@dataclass
class Person:
  name: str
  salary: float
  is_boss: bool


# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]

# filtering in O(n), sloooooow
target = 100000
filtered_collection = [x for x in collection if salary < target]

PS: Actually my use case is group by by a certain field, i.e. is_boss and filter by another, i.e. salary. How to do that? Should I user itertools.groupby over sorted lists and make my objects comparable?

Sengiley
  • 221
  • 2
  • 7
  • If your `list` doesn't have an structure, then yeah, you'll have to check them all. You're going to need to be much more specific about your use case if you want improvements beyond `O(n)` filtering passes. `equals_target = [x for x in y if x.field == target]` is the straightforward approach, but that's trivial, and no savings (big-O-wise) over naive solutions involving regular `for` loops. – ShadowRanger Jun 21 '22 at 19:19
  • 1
    Can you provide and example/sample data & sample output? – Matt Andruff Jun 21 '22 at 19:19
  • yes, I have a collection of objects, one of options to put them in a list – Sengiley Jun 21 '22 at 19:20
  • but I would filter only by the certain field, any options to organize my collection of object for faster filtering? – Sengiley Jun 21 '22 at 19:21
  • 1
    @Sengiley: There's a billion options. You need to be more specific on what your real problem is to get useful suggestions, ideally in the form of a [MCVE] of what you're doing now that's inadequate. – ShadowRanger Jun 21 '22 at 19:22
  • an example think of a collection of Employees and I need to filter by salary being less than equal some threshold – Sengiley Jun 21 '22 at 19:22
  • @Sengiley: An actual [MCVE] please. Edit the code into the question. I suspect, if you need to repeatedly filter by different target salaries, the answer will be maintaining a sorted `list` of the employees and using the `bisect` module to quickly find the slice index in question, but guessing at your goals from incredibly limited details is not easy. – ShadowRanger Jun 21 '22 at 19:23
  • yeah, I've got your point, added an code example – Sengiley Jun 21 '22 at 19:29
  • as for `bisect` can you also please provide an example? – Sengiley Jun 21 '22 at 19:31

2 Answers2

1

If you maintain your list in sorted order (which ideally means few insertions or removals, because mid-list insertion/removal is itself O(n)), you can find the set of Persons below a given salary with the bisect module.

from bisect import bisect
from operator import attrgetter

# if to store objects in a list...
collection = [Person("Jack", 50000, 0), ..., Person("Jane", 120000, 1)]
collection.sort(key=attrgetter('salary'))  # O(n log n) initial sort

# filtering searches in O(log n):
target = 100000
filtered_collection = collection[:bisect(collection, target, key=attrgetter('salary'))]

Note: The key argument to the various bisect module functions is only supported as of 3.10. In prior versions, you'd need to define the rich comparison operators for Person in terms of salary and search for a faked out Person object, or maintain ugly separate sorted lists, one of salary alone, and a parallel list of the Person objects.

For adding individual elements to the collection, you could use bisect's insort function. Or you could just add a bunch of items to the end of the list in bulk and resort it on the same key as before (Python's sorting algorithm, TimSort, gets near O(n) performance when the collection is mostly in order already, so the cost is not as high as you might think).

I'll note that in practice, this sort of scenario (massive data that can be arbitrarily ordered by multiple fields) usually calls for a database; you might consider using sqlite3 (eventually switching to a more production-grade database like MySQL or PostGres if needed), which, with appropriate indexes defined, would let you do O(log n) SELECTs on any indexed field; you could convert to Person objects on extraction for the data you actually need to work with. The B-trees that true DBMS solutions provide get you O(log n) effort for inserts, deletes and selects on the index fields, where Python built-in collection types make you choose; only one of insertion/deletion or searching can be truly O(log n), while the other is O(n).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • thanks! so I can to trade the time costs of creating and maintaining a list in sorted order for the speed up in search? what if I additionally want to filter by another field along with salary? – Sengiley Jun 21 '22 at 19:42
  • 1
    @Sengiley: You can `sort` by multiple fields (the `attrgetter` happily lets you pass multiple names to it), so if you want a backup comparison by `name` or the like that's possible (`key=attrgetter('salary', 'name')`). But if you need to sometimes filter by salary, and sometimes by name, sometimes by some other attribute, that's what true databases are for. No simple container type handles sorting the contents in many different ways simultaneously, but that's basically what databases are for. – ShadowRanger Jun 21 '22 at 19:44
  • and my use case is not a database, just a huge in-memory collection of objects – Sengiley Jun 21 '22 at 19:45
  • @Sengiley: Your use case can be *made* into a database. `sqlite3` supports purely in-memory databases (you just open `":memory:"` instead of a real file name). And if you need to select from that huge collection using many different criteria, you *want* a database. – ShadowRanger Jun 21 '22 at 19:46
  • yet, I thought something like B-trees but for Python's collections – Sengiley Jun 21 '22 at 19:47
  • and in no way I will use a database, this is a numeric code – Sengiley Jun 21 '22 at 19:48
  • your post it very insightful! can you please provide a code example of how to use `bisect` for maintaining and filtering object by two fields? thanks! – Sengiley Jun 21 '22 at 19:52
0

Arrays have a sort method - All you have to do is create a function that detirmes if an object is greater than another object - let me show you

class Foo:
    def __init__(bar):
        this.bar = bar

fooArray = [Foo(10),Foo(8),Foo(9)]
def sortFoo(foo):
    return foo.bar

fooArray.sort(key=sortFoo)
  • Nitpicks: There's no need for `sortFoo`; either `fooArray.sort(key=lamba x: x.bar)` or (with `from operator import attrgetter` at the top of the file) `fooArray.sort(key=attrgetter('bar'))` would be the typical ways to do simple attribute retrieval for the key in this case. PEP8 naming never uses `mixedCase`, everything you named with `mixedCase` should be `lowercase` or `lowercase_with_underscores`. – ShadowRanger Jun 21 '22 at 19:27
  • thanks for your answer, I'm numeric programmer mostly using numpy arrays, and wondering how to effectively maintain and sort a collection of object I think an ideal example will include some built-in objects methods like __eq__ or __lt__ and some advanced data structures like trees for example? I'm afraid that storing objects in a list for my purpose is ineffective – Sengiley Jun 21 '22 at 19:35