5

I have a list of integer numbers and I want to write a function that returns a subset of numbers that are within a range. Something like NumbersWithinRange(list, interval) function name...

I.e.,

list = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
interval = [4,20]
results = NumbersWithinRange(list, interval)  # [4,4,6,8,7,8]

maybe i forgot to write one more number in results, but that's the idea...

The list can be as big as 10/20 million length, and the range is normally of a few 100.

Any suggestions on how to do it efficiently with python - I was thinking to use bisect.

Thanks.

Dnaiel
  • 7,622
  • 23
  • 67
  • 126
  • 5
    You should not use `list` as a variable name. Python lets you (silently) reassign the built-in list constructor if you do... – the wolf Aug 24 '12 at 14:38
  • correct. it was just for the example, i wouldn't use that name in code. thanks for the correction. – Dnaiel Aug 24 '12 at 15:28

8 Answers8

7

I would use numpy for that, especially if the list is that long. For example:

In [101]: list = np.array([4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100])
In [102]: list
Out[102]: 
array([  4,   2,   1,   7,   9,   4,   3,   6,   8,  97,   7,  65,   3,
         2,   2,  78,  23,   1,   3,   4,   5,  67,   8, 100])
In [103]: good = np.where((list > 4) & (list < 20)) 
In [104]: list[good]
Out[104]: array([7, 9, 6, 8, 7, 5, 8])

# %timeit says that numpy is MUCH faster than any list comprehension: 
# create an array 10**6 random ints b/w 0 and 100
In [129]: arr = np.random.randint(0,100,1000000)
In [130]: interval = xrange(4,21)
In [126]: %timeit r = [x for x in arr if x in interval]
1 loops, best of 3: 14.2 s per loop

In [136]: %timeit good = np.where((list > 4) & (list < 20)) ; new_list = list[good]
100 loops, best of 3: 10.8 ms per loop

In [134]: %timeit r = [x for x in arr if 4 < x < 20]
1 loops, best of 3: 2.22 s per loop 

In [142]: %timeit filtered = [i for i in ifilter(lambda x: 4 < x < 20, arr)]
1 loops, best of 3: 2.56 s per loop
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
reptilicus
  • 10,290
  • 6
  • 55
  • 79
  • How does the list comprehension do when you use `4 <= x <= 20` instead of `x in interval`? Checking if the value is in an `xrange` iterator is going to slow things down. – chepner Aug 24 '12 at 14:27
  • yeah, its faster by a factor of 6, but still about 4 orders of magnitude slower than numpy – reptilicus Aug 24 '12 at 14:34
  • Ah, yes, I see now. I wasn't paying close attention to the units (s vs ms). numpy is indeed very much faster. – chepner Aug 24 '12 at 14:50
  • your timeit doesn't include `arr[good]`. Enumerating items one by one in numpy array is slow, Python list should be faster. Also OP says that the resulting array should have ~ 100 items; use larger limits for randint. – jfs Aug 24 '12 at 14:52
  • No, this is always going to be faster in numpy that with a Python list due to strong typing in a numpy array. Added arr[good], no performance hit – reptilicus Aug 24 '12 at 14:56
  • @chepner, yup, its 10 millisec with numpy vs 2 **seconds** for pure python – reptilicus Aug 24 '12 at 14:58
  • nice discussion created here, I'll look into both for my code. In my case I'll be calling this function million's of times so I really need it to be fast. I'll try them out. Thanks a lot! – Dnaiel Aug 24 '12 at 15:31
  • @user1443118: I meant that `[x for x in np_array if f(x)]` might be slower than `[x for x in py_list if f(x)]`. Naturally vectorized operations with numpy arrays that exclude pure Python loop should be faster. – jfs Aug 26 '12 at 01:04
6

The pure-Python Python sortedcontainers module has a SortedList type that can help you. It maintains the list automatically in sorted order and has been tested passed tens of millions of elements. The sorted list type has a bisect function you can use.

from sortedcontainers import SortedList
data = SortedList(...)

def NumbersWithinRange(items, lower, upper):
    start = items.bisect(lower)
    end = items.bisect_right(upper)
    return items[start:end]

subset = NumbersWithinRange(data, 4, 20)

Bisecting and indexing will be much faster this way than scanning the entire list. The sorted containers module is very fast and has a performance comparison page with benchmarks against alternative implementations.

GrantJ
  • 8,162
  • 3
  • 52
  • 46
4

If the list isn't sorted, you need to scan the entire list:

lst = [ 4,2,1,...]
interval=[4,20]
results = [ x for x in lst if interval[0] <= x <= interval[1] ]

If the list is sorted, you can use bisect to find the left and right indices that bound your range.

left = bisect.bisect_left(lst, interval[0])
right = bisect.bisect_right(lst, interval[1])

results = lst[left+1:right]

Since scanning the list is O(n) and sorting is O(n lg n), it probably is not worth sorting the list just to use bisect unless you plan on doing lots of range extractions.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • yes, I am planning on doing 100 or more million's of extractions, does it sound realistic or it'd take forever? – Dnaiel Aug 24 '12 at 15:32
  • Hard to say, but use the numpy answer, as it is far faster than a pure-python solution. – chepner Aug 24 '12 at 15:34
2

I think this should be sufficiently efficient:

>>> nums = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> r = [x for x in nums if 4 <= x <21]
>>> r
[4, 7, 9, 4, 6, 8, 7, 4, 5, 8]

Edit:

After J.F. Sebastian's excellent observation, modified the code.

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
  • Thats going to work, but be slow if the list really is of order 10^6 – reptilicus Aug 24 '12 at 14:12
  • Its not clarified if the list is sorted; if it is then bisect should be faster. – Burhan Khalid Aug 24 '12 at 14:25
  • 3
    `i in xrange` is not optimized (unlike `i in range` on Python 3). It is the same as `i in iterable` i.e., it enumerates the values one by one. So `4 <= x < 21` should be used instead – jfs Aug 24 '12 at 14:28
1

Using iterators

>>> from itertools import ifilter
>>> A = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> [i for i in ifilter(lambda x: 4 < x < 20, A)]
[7, 9, 6, 8, 7, 5, 8]
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130
1

Let's create a list similar to what you described:

import random  
l = [random.randint(-100000,100000) for i in xrange(1000000)]

Now test some possible solutions:

interval=range(400,800)

def v2():
    """ return a list """
    return [i for i in l if i in interval]

def v3():
    """ return a generator """
    return list((i for i in l if i in interval))

def v4():
    def te(x):
        return x in interval

    return filter(te,l)

def v5():
    return [i for i in ifilter(lambda x: x in interval, l)]    


print len(v2()),len(v3()), len(v4()), len(v5())
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)

Prints this:

   rate/sec   usec/pass   v5    v4    v2    v3
v5        0 6929225.922   -- -0.4% -1.0% -1.6%
v4        0 6903028.488 0.4%    -- -0.6% -1.2%
v2        0 6861472.487 1.0%  0.6%    -- -0.6%
v3        0 6817855.477 1.6%  1.2%  0.6%    --

HOWEVER, watch what happens if interval is a set instead of a list:

interval=set(range(400,800))
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)

  rate/sec  usec/pass     v5     v4     v3     v2
v5        5 201332.569     -- -20.6% -62.9% -64.6%
v4        6 159871.578  25.9%     -- -53.2% -55.4%
v3       13  74769.974 169.3% 113.8%     --  -4.7%
v2       14  71270.943 182.5% 124.3%   4.9%     --

Now comparing with numpy:

na=np.array(l)

def v7():
    """ assume you have to convert from list => numpy array and return a list """
    arr=np.array(l)
    tgt = np.where((arr >= 400) & (arr < 800)) 
    return [arr[x] for x in tgt][0].tolist()


def v8():
    """ start with a numpy list but return a python list """
    tgt = np.where((na >= 400) & (na < 800)) 
    return na[tgt].tolist()


def v9():
    """ numpy all the way through """
    tgt = np.where((na >= 400) & (na < 800)) 
    return [na[x] for x in tgt][0]  
    # or return na[tgt] if you prefer that syntax...    

cmpthese.cmpthese([v2,v3,v4,v5, v7, v8,v9],micro=True, c=2)  

   rate/sec  usec/pass      v5      v4      v7     v3     v2     v8     v9
v5        5 185431.957      --  -17.4%  -24.7% -63.3% -63.4% -93.6% -93.6%
v4        7 153095.007   21.1%      --   -8.8% -55.6% -55.7% -92.3% -92.3%
v7        7 139570.475   32.9%    9.7%      -- -51.3% -51.4% -91.5% -91.5%
v3       15  67983.985  172.8%  125.2%  105.3%     --  -0.2% -82.6% -82.6%
v2       15  67861.438  173.3%  125.6%  105.7%   0.2%     -- -82.5% -82.5%
v8       84  11850.476 1464.8% 1191.9% 1077.8% 473.7% 472.6%     --  -0.0%
v9       84  11847.973 1465.1% 1192.2% 1078.0% 473.8% 472.8%   0.0%     --   

Clearly numpy is faster than pure python as long as you can work with numpy all the way through. Otherwise, use a set for the interval to speed up a bit...

the wolf
  • 34,510
  • 13
  • 53
  • 71
  • 1
    That's not a fair comparison. `v3` doesn't actually do any of the work. You would need to compare `v2` with a version of `v3` which actually builds the list. – DSM Aug 24 '12 at 14:38
  • @DSM: fixed v3 to be a fair comparison – the wolf Aug 24 '12 at 16:18
0

I think you are looking for something like this..

b=[i for i in a if 4<=i<90]
print sorted(set(b))
[4, 5, 6, 7, 8, 9, 23, 65, 67, 78]
Baby Groot
  • 4,637
  • 39
  • 52
  • 71
user2290820
  • 2,709
  • 5
  • 34
  • 62
0

If your data set isn't too sparse, you could use "bins" to store and retrieve the data. For example:

a = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]

# Initalize a list of 0's [0, 0, ...]
# This is assuming that the minimum possible value is 0
bins = [0 for _ in range(max(a) + 1)]  

# Update the bins with the frequency of each number
for i in a:
    bins[i] += 1


def NumbersWithinRange(data, interval):
    result = []
    for i in range(interval[0], interval[1] + 1):
        freq = data[i]
        if freq > 0:
            result += [i] * freq
    return result

This works for this test case:

print(NumbersWithinRange(bins, [4, 20]))
# [4, 4, 4, 5, 6, 7, 7, 8, 8, 9]

For simplicity, I omitted some bounds checking in the function.

To reiterate, this may work better in terms of space and time usage, but it depends heavily on your particular data set. The less sparse the data set, the better it will do.

dannyhan12
  • 942
  • 8
  • 9