11

I want to calculate conditional probabilites of ratings('A','B','C') in ratings column.

    company     model    rating   type
0   ford       mustang     A      coupe
1   chevy      camaro      B      coupe
2   ford       fiesta      C      sedan
3   ford       focus       A      sedan
4   ford       taurus      B      sedan
5   toyota     camry       B      sedan

Output:

Prob(rating=A) = 0.333333 
Prob(rating=B) = 0.500000 
Prob(rating=C) = 0.166667 

Prob(type=coupe|rating=A) = 0.500000 
Prob(type=sedan|rating=A) = 0.500000 
Prob(type=coupe|rating=B) = 0.333333 
Prob(type=sedan|rating=B) = 0.666667 
Prob(type=coupe|rating=C) = 0.000000 
Prob(type=sedan|rating=C) = 1.000000 

Any help, Thanks..!!

Chandu
  • 2,053
  • 3
  • 25
  • 39

5 Answers5

20

You can use .groupby() and the built-in .div():

rating_probs = df.groupby('rating').size().div(len(df))

rating
A    0.333333
B    0.500000
C    0.166667

and the conditional probs:

df.groupby(['type', 'rating']).size().div(len(df)).div(rating_probs, axis=0, level='rating')

coupe  A         0.500000
       B         0.333333
sedan  A         0.500000
       B         0.666667
       C         1.000000
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • I believe the `level` parameter should be 0, and not 1. – Patthebug Feb 09 '17 at 23:30
  • Have you tried? I get missing values when I change the working version above to `level=0`. The `rating_probs` would have to align with the `rating` level, which is level 1. You can also change `level=1` to `level='rating`` to better see why this works. Just edited accordingly. – Stefan Feb 10 '17 at 17:58
  • Interesting. I was getting results with `level=1`, but those results were incorrect. Changing it to `level=0` gave me the correct conditional probability values. – Patthebug Feb 10 '17 at 18:16
  • Just to clarify - are you saying the above results are wrong, or that you are getting different results using the above code? – Stefan Feb 10 '17 at 18:21
  • The above results look correct. I was getting incorrect results in my dataset, which I corrected using `level=0`. – Patthebug Feb 10 '17 at 18:23
  • I put 90 rows in - get 89 back. – Ramin Melikov Sep 30 '20 at 16:30
  • The example has 6 rows, the result returns 5 rows because not all pairs exist. Perhaps your case is similar @RaminMelikov? – Stefan Sep 30 '20 at 20:42
4

You can use groupby:

In [2]: df = pd.DataFrame({'company': ['ford', 'chevy', 'ford', 'ford', 'ford', 'toyota'],
                     'model': ['mustang', 'camaro', 'fiesta', 'focus', 'taurus', 'camry'],
                     'rating': ['A', 'B', 'C', 'A', 'B', 'B'],
                     'type': ['coupe', 'coupe', 'sedan', 'sedan', 'sedan', 'sedan']})

In [3]: df.groupby('rating').count()['model'] / len(df)
Out[3]:
rating
A    0.333333
B    0.500000
C    0.166667
Name: model, dtype: float64

In [4]: (df.groupby(['rating', 'type']).count() / df.groupby('rating').count())['model']
Out[4]:
rating  type
A       coupe    0.500000
        sedan    0.500000
B       coupe    0.333333
        sedan    0.666667
C       sedan    1.000000
Name: model, dtype: float64
Gustavo Bezerra
  • 9,984
  • 4
  • 40
  • 48
3

You need add reindex for add 0 values for missing pairs:

mux = pd.MultiIndex.from_product([df['rating'].unique(), df['type'].unique()])
s = (df.groupby(['rating', 'type']).count() / df.groupby('rating').count())['model']
s = s.reindex(mux, fill_value=0)
print (s)
A  coupe    0.500000
   sedan    0.500000
B  coupe    0.333333
   sedan    0.666667
C  coupe    0.000000
   sedan    1.000000
Name: model, dtype: float64

And another solution, thanks Zero:

s.unstack(fill_value=0).stack()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

pd.crosstab(df.type, df.rating, margins=True, normalize="index")

   rating     A       B       C
   type                           
   coupe   0.500000  0.5  0.000000
   sedan   0.250000  0.5  0.250000
   All     0.333333  0.5  0.166667

Here the All row gives you probabilities for A, B, and C, now for conditional probabilities.

pd.crosstab(df.type, df.rating, margins=True, normalize="columns")

 rating   A      B       C     All
 type                                
 coupe   0.5  0.333333  0.0  0.333333
 sedan   0.5  0.666667  1.0  0.666667

Here your conditional probabilities are in the table for example conditional probability for a given type is a coupe and it has an A rating is 0.5 in row coupe and column A. Prob(type=coupe|rating=A) = 0.5

  • This is a very clean and concise way to do it! Just requires one to learn how the crosstab function works. Very useful. – DataBach Dec 03 '21 at 12:08
-1

first, convert into a pandas dataframe. by doing so, you can take advantage of pandas' groupby methods.

collection = {"company": ["ford", "chevy", "ford", "ford", "ford", "toyota"],
              "model": ["mustang", "camaro", "fiesta", "focus", "taurus", "camry"],
              "rating": ["A", "B", "C", "A", "B", "B"],
              "type": ["coupe", "coupe", "sedan", "sedan", "sedan", "sedan"]}

df = pd.DataFrame(collection)

then, groupby based on events (ie rating).

df_s = df.groupby('rating')['type'].value_counts() / df.groupby('rating')['type'].count()
df_f = df_s.reset_index(name='cpt')
df_f.head()  # your conditional probability table
mynameisvinn
  • 341
  • 4
  • 10