0

I have a list of several thousand floats that I want to be able to slice by min and max values.

E.G. using:

flist = [1.9842, 9.8713, 5.4325, 7.6855, 2.3493, 3.3333]

(my actual list is 400,000 floats long, but the above is a working example)

I want something like

def listclamp(minn, maxn, nlist):

such that

print listclamp(3, 8, flist)

should give me

[3.3333, 5.4325, 7.6855]

I also need to do this 10,000 to 30,000 times, so speed does count.

(I have no sample code for what I've tried so far, because this is new python territory for me)

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Xodarap777
  • 1,358
  • 4
  • 19
  • 42
  • 1
    Filter the items and then sort them or vice versa. Though filtering first is a going to be better as it will the reduce `N` in `NlogN`. – Ashwini Chaudhary Nov 19 '14 at 22:04
  • Do you have any code to show what you've tried? – skrrgwasme Nov 19 '14 at 22:05
  • 2
    The first way I tried to do this takes under 1 microsecond, so doing this many thousands of times will still take a tiny fraction of a second. So does speed _really_ count? – abarnert Nov 19 '14 at 22:05
  • Did your list contain "several thousand floats", or just the example he gave? Also did you slice "many thousand of times" ? – ventsyv Nov 19 '14 at 22:21
  • @ventsyv: I sliced it 100000 times, and the time per slice was 0.98us. But I only had 6 floats. Let me repeat the test with several thousand. – abarnert Nov 19 '14 at 22:23
  • I'm curious to find out. I've been doing python for less than a year and I wondered how does it compare to C++. – ventsyv Nov 19 '14 at 22:25
  • Apologies. My actual list contains about 400,000 floats. I need to filter it about 10,000 - 30,000 times. I didn't show code as I have it so far because I have no concept of this area of python. – Xodarap777 Nov 19 '14 at 22:25
  • 1
    What's the range of your data? Are you values pretty evenly spread out through the range, or is the data in clusters? Finally how many significant digits do you have? – ventsyv Nov 19 '14 at 22:28
  • That last comment was for Xodarap777. – ventsyv Nov 19 '14 at 22:34
  • If you _really_ need speed, you probably should be using NumPy. Let me do a quick test that way and show how it compares. – abarnert Nov 19 '14 at 22:38
  • @ventsyv: The clustering and significant digits are an interesting point; if you're going to have 200K filtered values but only 5K unique-within-significant-digits values selecting into bins will be a lot faster than any general-purpose sort, and it'll filter for free… – abarnert Nov 19 '14 at 22:50
  • The spread is uniform and the sigdigs are 6. – Xodarap777 Nov 19 '14 at 22:52
  • @Xodarap777: OK, but how narrow are the slices? In your example, 3 of the 6 values are selected by the clamp. In your real code, does that mean it'll usually be about 200K out of 400K selected? – abarnert Nov 19 '14 at 22:56
  • @abarnert: Good point. What I had in mind was that you might have clusters of useful data within large amount of noise. In that case filtering the noise out (through data normalization) and only taking the outliers is probably the better approach. That's what I meant, but you are making a great point about to duplicates. – ventsyv Nov 19 '14 at 23:02
  • @Xodarap777: how fast does it need to be? "I don't want to fall asleep at the keyboard before this thing finishes" fast or stock market rapid trading fast? Can you run abarnert's examples with your actual data and let us know what the numbers are? – ventsyv Nov 19 '14 at 23:05

3 Answers3

4

The obvious thing to do is either sort then filter, or filter then sort.

If you have the same list every time, sorting first is obviously a win, because then you only need to sort once instead of every time. It also means you can use a binary search for the filtering instead of a linear walk (as explained in ventsyv's answer—although that probably won't pay off unless your lists are much longer than this one.

If you have different lists every time, filtering first is probably a win, because the sort is probably the slow part, and you're sorting a smaller list that way.

But let's stop speculating and start testing.

Using a list of several thousand floats, about half of which are in range:

In [1591]: flist = [random.random()*10 for _ in range(5000)]
In [1592]: %timeit sorted(x for x in flist if 3 <= x < 8)
100 loops, best of 3: 3.12 ms per loop
In [1593]: %timeit [x for x in sorted(flist) if 3 <= x < 8]
100 loops, best of 3: 4 ms per loop
In [1594]: %timeit l=sorted(flist); l[bisect.bisect_left(l, 3):bisect.bisect_right(l, 8)]
100 loops, best of 3: 3.36 ms per loop

So, filtering then sorting wins; ventsyn's algorithm does make up for part of the difference, but not all of it. But course if we only have a single list to sort, sorting it once instead of thousands of times is an obvious win:

In [1596]: l = sorted(flist)
In [1597]: %timeit l[bisect.bisect_left(l, 3):bisect.bisect_right(l, 8)]
10000 loops, best of 3: 29.2 µs per loop

So, if you have the same list over and over, obviously sort it once.

Otherwise, you could test on your real data… but we're talking about shaving up to 22% off of something that takes milliseconds. Even if you do it many thousands of times, that's saving you under a second. Just the cost of typing the different implementations—much less understanding them, generalizing them, debugging them, and performance testing them—is more than that.


But really, if you're doing millions of operations spread over hundreds of thousands of values, and speed is important, you shouldn't be using a list in the first place, you should be using a NumPy array. NumPy can store just the raw float values, without boxing them up as Python objects. Besides saving memory (and improving cache locality), this means that the inner loop in, say, np.sort is faster than the inner loop in sorted, because it doesn't have to make a Python function call that ultimately involves unboxing two numbers, it just has to do a comparison directly.

Assuming you're storing your values in an array in the first place, how does it stack up?

In [1607]: flist = np.random.random(5000) * 10
In [1608]: %timeit a = np.sort(flist); a = a[3 <= a]; a = a[a < 8]
1000 loops, best of 3: 742 µs per loop
In [1611]: %timeit c = b[3 <= b]; d = c[c < 8]
10000 loops, best of 3: 29.8 µs per loop

So, it's about 4x faster than filter-and-sort for the "different lists" case, even using a clunky algorithm (I was looking for something I could cram onto one %timeit line, rather than the fastest or most readable…). And for the "same list over and over" case, it's almost as fast as the bisect solution even without bisecting (but of course you can bisect with NumPy, too).

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
1

Sort the list (if you use the same list over and over, sort it just once), then use binary search to find the position of the lower and upper bounds. Think of it, there is a package that does - bisect.

ventsyv
  • 3,316
  • 3
  • 27
  • 49
  • 1
    Binary search replaces a linear filter with a logarithmic search, which is nice… but it also means you have to sort the whole list instead of the smaller filtered list, which could easily cost you more than you gained. – abarnert Nov 19 '14 at 22:08
  • I doubt that. He said he'll have to do slice the list thousands of times, it will probably end up being faster to sort the list once. – ventsyv Nov 19 '14 at 22:10
  • I was assuming he needs to do this to thousands of *different* lists. You're right, if it's always the same list, he should just sort it once. – abarnert Nov 19 '14 at 22:11
  • Also, you have to sort the resulting slice, if the slices are big, it might not make much difference if you are sorting the whole array or just the slice. But in the general case you are right, sorting will be too expensive if you don't slice the same array enough times. – ventsyv Nov 19 '14 at 22:18
  • As it turns out (at least from a quick test), even using different lists each time, you have to get to pretty narrow slices before sort-then-binsearch is much slower than filter-then-sort. See my (updated) answer for the numbers on my best guess at his real data. (Still, it's definitely _simpler_ to filter and sort than to sort and binary-filter, and since we're talking about saving under a second I'd go with simpler…) – abarnert Nov 19 '14 at 22:36
1

This will return sorted list you want:

flist = [1.9842, 9.8713, 5.4325, 7.6855, 2.3493, 3.3333]

def listclamp(minn, maxn, nlist): 
    return sorted(filter(lambda x: xminn <= x <= maxn, nlist))

print listclamp(3, 8, flist) 

A faster approach, using list comprehensions:

def listclamp2(minn, maxn, nlist): 
    return sorted([x for x in flist if (minn <= and x<=maxn)])

print listclamp2(3, 8, flist)

Note that depending on your data it may be better to filter the list first and then sort it (as I did in the code above).

For more information on performance, refer to this link.

Community
  • 1
  • 1
syntagma
  • 23,346
  • 16
  • 78
  • 134
  • 2
    Just use `minn <= x <= maxn`. The conversion to float is unnecessary (and wasteful if you're going to do it over and over), and using two separate comparisons instead of a chained comparison is less readable and slower. (Also, I think a comprehension is going to be both more readable and faster than a `filter` call if you have to build a `lamdba` just to use `filter`, but that's a judgment call and something to test with real data, respectively.) – abarnert Nov 19 '14 at 22:10
  • agree list comprehension is virtually always better than filter + lambda – wim Nov 19 '14 at 22:12
  • You've taken out the `float` calls, but you're still using separate comparisons instead of a chained comparison. (And a list instead of a generator expression in the comprehension version, but I don't think that'll make much difference here.) – abarnert Nov 19 '14 at 22:32