0

I'm using qcut from Pandas to properly prepare my data for a machine learning algorithm. I have products with prices, and I discretized my data into equal-sized buckets with this code :

df['PriceBucket'] = pd.qcut(df['sell_prix'].sort_values(), 10, labels=False)

And this code to have more details about my labels :

df['PriceBucketTitle'] = pd.qcut(df['sell_prix'].sort_values(), 10)

As seen below, I have PriceBucket and PriceBucketTitle and it's perfect ! Now, I want to have the number of elements wich are taken into account. This code returns NaN values (as seen below):

df['products_by_number'] = pd.qcut(df['sell_prix'], 10, labels=False).value_counts()

I know that's might be feasible if I do a grouby by PriceBucket, but I want to keep my Data format. This is the result :

      sell_prix PriceBucket PriceBucketTitle    products_by_number
4668    8.0          2         (6.5, 8.5]            NaN
4669    8.0          2         (6.5, 8.5]            NaN
4670    8.0          2         (6.5, 8.5]            NaN
4671    8.0          2         (6.5, 8.5]            NaN
4672    8.0          2         (6.5, 8.5]            NaN
4673    8.0          2         (6.5, 8.5]            NaN
4674    8.0          2         (6.5, 8.5]            NaN
4675    8.0          2         (6.5, 8.5]            NaN
4676    8.0          2         (6.5, 8.5]            NaN
4677    8.0          2         (6.5, 8.5]            NaN
11902   15.0         5         (12.9, 15]            NaN
11903   15.0         5         (12.9, 15]            NaN
11904   15.0         5         (12.9, 15]            NaN
11905   15.0         5         (12.9, 15]            NaN
11906   15.0         5         (12.9, 15]            NaN
11907   15.0         5         (12.9, 15]            NaN
11908   15.0         5         (12.9, 15]            NaN
11909   15.0         5         (12.9, 15]            NaN
11910   15.0         5         (12.9, 15]            NaN
11911   15.0         5         (12.9, 15]            NaN
12065   11.0         4         (10, 12.9]            NaN
12066   11.0         4         (10, 12.9]            NaN

For exemple, this is what I want :

      sell_prix PriceBucket PriceBucketTitle    products_by_number
4668    8.0          2         (6.5, 8.5]            984546.0
4669    8.0          2         (6.5, 8.5]            984546.0
4670    8.0          2         (6.5, 8.5]            984546.0
4671    8.0          2         (6.5, 8.5]            984546.0
4672    8.0          2         (6.5, 8.5]            984546.0
4673    8.0          2         (6.5, 8.5]            984546.0
4674    8.0          2         (6.5, 8.5]            984546.0
4675    8.0          2         (6.5, 8.5]            984546.0
4676    8.0          2         (6.5, 8.5]            984546.0
4677    8.0          2         (6.5, 8.5]            984546.0
11902   15.0         5         (12.9, 15]            1028141.0
11903   15.0         5         (12.9, 15]            1028141.0
11904   15.0         5         (12.9, 15]            1028141.0
11905   15.0         5         (12.9, 15]            1028141.0
11906   15.0         5         (12.9, 15]            1028141.0
11907   15.0         5         (12.9, 15]            1028141.0
11908   15.0         5         (12.9, 15]            1028141.0
11909   15.0         5         (12.9, 15]            1028141.0
11910   15.0         5         (12.9, 15]            1028141.0
11911   15.0         5         (12.9, 15]            1028141.0
12065   11.0         4         (10, 12.9]            48998.0
12066   11.0         4         (10, 12.9]            48998.0

Help ? Thanx !

Arij SEDIRI
  • 2,088
  • 7
  • 25
  • 43
  • For me `df['PriceBucket'] = pd.qcut(df['sell_prix'].sort_values(), 10, labels=False)` doesnt work, because `ValueError: Bin edges must be unique:` – jezrael Jul 21 '16 at 12:34
  • @jezrael : It's normal, you don't have the all data ! – Arij SEDIRI Jul 21 '16 at 12:44
  • No problem, but I cannot find solution, because these duplicates. Can you add sample, which is working with only 5 - 8 rows? – jezrael Jul 21 '16 at 12:45

0 Answers0