1

I have a list object,i want to know that how many numbers are in a particular interval?The code is as follows

a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
interval = 3
a = list(map(lambda x:int(x/interval),a))
for i in range(min(a),max(a)+1):
    print(i*interval,(i+1)*interval,':',a.count(i))

Output

0 3 : 2
3 6 : 4
6 9 : 5
9 12 : 1
12 15 : 1
15 18 : 1
18 21 : 0
21 24 : 0
24 27 : 0
27 30 : 1

Is there a simple way to get this information?The simpler the better

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
wangtianye
  • 306
  • 1
  • 5

3 Answers3

3

Now that we're talking about performance, I'd like to offer my numpy solution using bincount:

import numpy as np

interval = 3
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
l = max(a) // interval + 1
b = np.bincount(a, minlength=l*interval).reshape((l,interval)).sum(axis=1)

(minlength is necessary just to be able to reshape if max(a) isn't a multiple of interval)

With the lables taken from Erfan's answer we get:

rnge = range(0, max(a) + interval + 1, interval)
lables = [f'[{i}-{j})' for i, j in zip(rnge[:-1], rnge[1:])]

for l,b in zip(lables,b):
    print(l,b)

[0-3) 2
[3-6) 4
[6-9) 5
[9-12) 1
[12-15) 1
[15-18) 1
[18-21) 0
[21-24) 0
[24-27) 0
[27-30) 1

This is much faster than the pandas solution.

Performance and scaling comparison

In order to assess the scaling capability, I just replaced a = [1, ..., 28] * n and timed the execution (without imports and printing) for n = 1, 10, 100, 1K, 10K and 100K:

enter image description here

(python 3.7.3 on win32 / pandas 0.24.2 / numpy 1.16.2)

Stef
  • 28,728
  • 2
  • 24
  • 52
  • @jezrael: I guess this wouldn't principally change the picture, as the trend is already clearly visible. But I can try and add some more test. – Stef Jul 09 '19 at 15:03
  • ok, let me know after test, also [perfplot](https://github.com/nschloe/perfplot) should be used here – jezrael Jul 09 '19 at 15:08
  • 1
    @jezrael: I've updated the tests for list lengths up to 1.5 million elements. Starting from n = 1K (15000 elements) there's no more qualitative change between the three algorithms, so I expect the lines to keep running parallel in the log-log-plot even for still bigger values of n. – Stef Jul 09 '19 at 15:45
2

Pandas solution with pd.cut and groupby

s = pd.Series(a)
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False)
s.groupby(bins).count()
[0, 3)      2
[3, 6)      4
[6, 9)      5
[9, 12)     1
[12, 15)    1
[15, 18)    1
[18, 21)    0
[21, 24)    0
[24, 27)    0
[27, 30)    1
dtype: int64

To get cleaner bins results, we can use this method from linked answer:

s = pd.Series(a)
rnge = range(0, s.max() + interval, interval)
labels = [f'{i}-{j}' for i, j in zip(rnge[:-1], rnge[1:])]
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False, labels=labels)
s.groupby(bins).count()
0-3      2
3-6      4
6-9      5
9-12     1
12-15    1
15-18    1
18-21    0
21-24    0
24-27    0
27-30    1
dtype: int64
Erfan
  • 40,971
  • 8
  • 66
  • 78
1

You can do it in one line using a dictionary comprehension :

a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]

{"[{};{}[".format(x, x+3) : len( [y  for y in a if y >= x and y < x+3] ) 
 for x in range(min(a), max(a), 3)}

Output :

{'[1;4[': 3,
 '[4;7[': 4,
 '[7;10[': 5,
 '[10;13[': 1,
 '[13;16[': 0,
 '[16;19[': 1,
 '[19;22[': 0,
 '[22;25[': 0,
 '[25;28[': 0}

Performance comparison :

Pandas solution with pd.cut and groupby : 8.51 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dictionary comprehension : 19.7 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using np.bincount : 22.4 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Community
  • 1
  • 1
vlemaistre
  • 3,301
  • 13
  • 30
  • You solution might be good for small data since it skip all the pandas overhead. But it is slow for larger data due to lack of vectorization (try `1e4` data points and you would see the difference). – Quang Hoang Jul 09 '19 at 12:39
  • You're right list and dictionnary comprehensions don't scale as well as vectorization functions. But they are often overlooked as a consequence when they should be prioritized when you have small data, as you see it's 1000 times faster – vlemaistre Jul 09 '19 at 12:44
  • @vlemaistre could you include [my solution](https://stackoverflow.com/a/56953697/3944322) in your performance comparison. At least at my computer it's way faster than the pandas solution. Can you verify this? – Stef Jul 09 '19 at 13:35
  • @Stef I included it. It is indeed 2.5 times faster than using pandas cut and groupby – vlemaistre Jul 09 '19 at 13:42
  • @vlemaistre Thank you. On Windows 32bit python 3.7 / pandas 0.24.2 / numpy 1.16.2 I get per performance difference of 100 times. Do you have any idea why there is so big a difference between different installations? – Stef Jul 09 '19 at 13:53
  • We should have similar results it's weird. I executed everything from the creation of the array a to the end result for all of the methods. How did you proceed ? – vlemaistre Jul 09 '19 at 13:55
  • I did the same, except for the printing of the results. See [here](https://pastebin.com/7FMeygn6) for the exact code and results. – Stef Jul 09 '19 at 14:13
  • You're right, I get the same thing when I remove the print ! My bad for including the print I should've removed it, I'll update your result. – vlemaistre Jul 09 '19 at 14:15