3

I have a data frame df with some basic web stats ranked by Page Views (PVs):

URL  PVs
1    1500
2    1200
3    900
4    700
:
100  25

I am trying to filter and count number of URLs which contribute different percentile of page views (PVs). Say, I want to know how many and which once URLs brought 90% of PVs (or 10%).

I calculated percentiles:

df.quantile(np.linspace(.1, 1, 9, 0))

And I know I can iterate through rows like this (so I can sum them up):

for index, row in df.iterrows():
    print row['PVs']

But I cannot figure out how to stop when a certain threshold is reached. Will appreciate your help!

aviss
  • 2,179
  • 7
  • 29
  • 52

2 Answers2

7

I think you need need for count sum of True values by conditions:

a = (df['PVs'] > df['PVs'].quantile(0.9)).sum()
print (a)
1
df1 = df[df['PVs'] > df['PVs'].quantile(0.9)]
print (df1)
   URL   PVs
0    1  1500

a = (df['PVs'] < df['PVs'].quantile(0.1)).sum()
print (a)
1
df1 = df[df['PVs'] < df['PVs'].quantile(0.1)]
print (df1)
   URL  PVs
4  100   25

And if need counts of all quantiles:

df1 = df.groupby(pd.qcut(df['PVs'], 10)).size()
print (df1)
PVs
(24.999, 295.0]     1
(295.0, 565.0]      0
(565.0, 740.0]      1
(740.0, 820.0]      0
(820.0, 900.0]      1
(900.0, 1020.0]     0
(1020.0, 1140.0]    0
(1140.0, 1260.0]    1
(1260.0, 1380.0]    0
(1380.0, 1500.0]    1
dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I probably will use both answers for slightly different tasks but this answer was exactly what I needed. Thanks as usual! – aviss Jul 11 '17 at 19:05
3

Consider the series of urls

s = pd.Series(np.random.randint(100, size=10000), name='URL')

Get a list of counts using pd.Series.value_counts and use the normalize=True option. Also, make sure to sort ascending with ascending=True

vc = s.value_counts(normalize=True, ascending=True)

vc is now a series with URLs in the index and normalized counts as the values. Because it is sorted ascending, we can perform a cumulative sum and pluck out the positions of the items at the breakpoints you are looking for.

a = vc.cumsum().searchsorted(np.linspace(.1, 1, 9, 0))

vc.index[a]

Int64Index([64, 40, 20, 18, 9, 45, 67, 30, 77], dtype='int64')

We can observe the results

a = vc.cumsum().searchsorted(np.linspace(.1, 1, 9, 0))
pd.concat([vc.cumsum().iloc[a], vc.iloc[a]], axis=1, keys=['Cumsum', 'Normalized'])

    Cumsum  Normalized
64  0.1075      0.0089
40  0.2083      0.0094
20  0.3036      0.0096
18  0.4010      0.0099
9   0.5010      0.0101
45  0.6032      0.0103
67  0.7084      0.0106
30  0.8049      0.0108
77  0.9053      0.0114
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    If `ascending=True` slows it down, we can skip it, because the later `.cumsum()` guarantees the sorted nature. – Divakar Jul 11 '17 at 17:45
  • @Divakar absolutely correct. In my test example, I create `vc` as a proxy for OP's data. The `cumsum` portion is the same either way. – piRSquared Jul 11 '17 at 17:50