How much data are there in an interval?

Question

I have a list object,i want to know that how many numbers are in a particular interval?The code is as follows

a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
interval = 3
a = list(map(lambda x:int(x/interval),a))
for i in range(min(a),max(a)+1):
    print(i*interval,(i+1)*interval,':',a.count(i))

Output

Is there a simple way to get this information?The simpler the better

Look at `numpy.histogram` or `pandas.cut` – ImportanceOfBeingErnest Jul 09 '19 at 11:46 — ImportanceOfBeingErnest, Jul 09 '19 at 11:46

Stef · Answer 1 · 2019-07-09T15:39:32.603

3

Now that we're talking about performance, I'd like to offer my numpy solution using bincount:

import numpy as np

interval = 3
a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]
l = max(a) // interval + 1
b = np.bincount(a, minlength=l*interval).reshape((l,interval)).sum(axis=1)

(minlength is necessary just to be able to reshape if max(a) isn't a multiple of interval)

With the lables taken from Erfan's answer we get:

rnge = range(0, max(a) + interval + 1, interval)
lables = [f'[{i}-{j})' for i, j in zip(rnge[:-1], rnge[1:])]

for l,b in zip(lables,b):
    print(l,b)

[0-3) 2
[3-6) 4
[6-9) 5
[9-12) 1
[12-15) 1
[15-18) 1
[18-21) 0
[21-24) 0
[24-27) 0
[27-30) 1

This is much faster than the pandas solution.

Performance and scaling comparison

In order to assess the scaling capability, I just replaced a = [1, ..., 28] * n and timed the execution (without imports and printing) for n = 1, 10, 100, 1K, 10K and 100K:

(python 3.7.3 on win32 / pandas 0.24.2 / numpy 1.16.2)

edited Jul 09 '19 at 15:39

answered Jul 09 '19 at 13:26

Stef

28,728
2
24
52

@jezrael: I guess this wouldn't principally change the picture, as the trend is already clearly visible. But I can try and add some more test. – Stef Jul 09 '19 at 15:03
ok, let me know after test, also [perfplot](https://github.com/nschloe/perfplot) should be used here – jezrael Jul 09 '19 at 15:08
1

@jezrael: I've updated the tests for list lengths up to 1.5 million elements. Starting from n = 1K (15000 elements) there's no more qualitative change between the three algorithms, so I expect the lines to keep running parallel in the log-log-plot even for still bigger values of n. – Stef Jul 09 '19 at 15:45

Erfan · Answer 2 · 2019-07-09T12:07:23.363

Pandas solution with `pd.cut` and `groupby`

s = pd.Series(a)
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False)
s.groupby(bins).count()

[0, 3)      2
[3, 6)      4
[6, 9)      5
[9, 12)     1
[12, 15)    1
[15, 18)    1
[18, 21)    0
[21, 24)    0
[24, 27)    0
[27, 30)    1
dtype: int64

To get cleaner bins results, we can use this method from linked answer:

s = pd.Series(a)
rnge = range(0, s.max() + interval, interval)
labels = [f'{i}-{j}' for i, j in zip(rnge[:-1], rnge[1:])]
bins = pd.cut(s, range(0, s.max() + interval, interval), right=False, labels=labels)
s.groupby(bins).count()

0-3      2
3-6      4
6-9      5
9-12     1
12-15    1
15-18    1
18-21    0
21-24    0
24-27    0
27-30    1
dtype: int64

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

1

You can do it in one line using a dictionary comprehension :

a = [1, 7, 4, 7, 4, 8, 5, 2, 17, 8, 3, 12, 9, 6, 28]

{"[{};{}[".format(x, x+3) : len( [y  for y in a if y >= x and y < x+3] ) 
 for x in range(min(a), max(a), 3)}

Output :

{'[1;4[': 3,
 '[4;7[': 4,
 '[7;10[': 5,
 '[10;13[': 1,
 '[13;16[': 0,
 '[16;19[': 1,
 '[19;22[': 0,
 '[22;25[': 0,
 '[25;28[': 0}

Performance comparison :

Pandas solution with pd.cut and groupby : 8.51 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dictionary comprehension : 19.7 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using np.bincount : 22.4 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 09 '19 at 11:47

vlemaistre

3,301
13
30

You solution might be good for small data since it skip all the pandas overhead. But it is slow for larger data due to lack of vectorization (try `1e4` data points and you would see the difference). – Quang Hoang Jul 09 '19 at 12:39
You're right list and dictionnary comprehensions don't scale as well as vectorization functions. But they are often overlooked as a consequence when they should be prioritized when you have small data, as you see it's 1000 times faster – vlemaistre Jul 09 '19 at 12:44
@vlemaistre could you include [my solution](https://stackoverflow.com/a/56953697/3944322) in your performance comparison. At least at my computer it's way faster than the pandas solution. Can you verify this? – Stef Jul 09 '19 at 13:35
@Stef I included it. It is indeed 2.5 times faster than using pandas cut and groupby – vlemaistre Jul 09 '19 at 13:42
@vlemaistre Thank you. On Windows 32bit python 3.7 / pandas 0.24.2 / numpy 1.16.2 I get per performance difference of 100 times. Do you have any idea why there is so big a difference between different installations? – Stef Jul 09 '19 at 13:53
We should have similar results it's weird. I executed everything from the creation of the array a to the end result for all of the methods. How did you proceed ? – vlemaistre Jul 09 '19 at 13:55
I did the same, except for the printing of the results. See [here](https://pastebin.com/7FMeygn6) for the exact code and results. – Stef Jul 09 '19 at 14:13
You're right, I get the same thing when I remove the print ! My bad for including the print I should've removed it, I'll update your result. – vlemaistre Jul 09 '19 at 14:15

How much data are there in an interval?

3 Answers3

Performance and scaling comparison

Pandas solution with pd.cut and groupby

Pandas solution with `pd.cut` and `groupby`