1

I just start learning Python. Here is a data frame:

a=pd.DataFrame({'A1':[0,1,2,3,2,1,6,0,1,1,7,10]})

Now I think this data follows multinomial distribution. So, 12 numbers means the frequency of 12 categories (category 0, 1, 2...). For example, the occurance of category 0 is 0. So, I hope to find all the parameters of multinomial given this data. In the end, we have the best parameters of multinomial (or we can say the best probility for every number). For example,

category:    0,      1,     2,     3,      4...
weights:    0.001,  0.1,   0.2,   0.12,   0.2...

So, I do not need a test data to predict. This is not a classification. As a new comer, I am not even sure if I should use scipy.stats.multinomial or sklearn models, or some other techs. So, could anyone give me some help?

Feng Chen
  • 2,139
  • 4
  • 33
  • 62

1 Answers1

9

Maximum Likelihood Estimation (MLE) is one of the most important procedure to obtain point estimates for parameters of a distribution. This is what you need to start with.

Analytical Solution:

Multinational distribution is an extension to binomial distribution for which MLE can be obtained analytically. Refer this math stack exchange post (MLE for Multinomial Distribution) for full analytical solution. The procedure starts with defining a likelihood function, L(p) conditioned on observed data x(i), where p and x are the probabilities and observed occurrences for k classes/ categories and i= 0,1,...k. Its a measure of likelihood of observing a set of observations (x) given parameter set (p):

L(p) equals:

Likelihood function Multinomial

The main idea is to maximize the likelihood function value over the range of parameters (p). Given the total observations n (i.e. sum of occurrences for all categories), the point estimates equals:

Pi equals xi/ n

a.values/a.values.sum()                        # point estimates for p = x/n

# array([[0.        ], [0.02941176], [0.05882353], [0.08823529], 
#        [0.05882353], [0.02941176], [0.17647059], [0.        ], 
#        [0.02941176], [0.02941176], [0.20588235], [0.29411765]])

Numerical Solution:

The above results can also be numerically obtained using scipy.optimize.minimize. Notice that L(p) is a product of factorial and exponential terms. The factorial term is a constant and does not depends on the parameter values (p), therefore not considered for optimization. For the exponential terms, it is better to perform a log transformation to simplify the objective function; common practice for MLE, as log is a monotone increasing function. Also, as the scipy.optimize.minimize is used for minimization, we will use the negative of our log transformed likelihood function. Note that maximizing a function value is equal to minimizing its negative value.

import pandas as pd
import numpy as np
import scipy.optimize as sciopt

# bounds for parameters to lie between (0,1), 
# absolute zero (0) for lower bound avoided as log takes an infinite value 
bnds = [(0.001e-12,1) for i in range(12)]

# Initializing parameters value for optimization
init_parameters = np.asarray([0.1 for i in range(12)])

# Negative Log Likelihood Function
neg_log_lik = lambda p: -np.sum([a.values[i]*np.log(p[i]) for i in range(12)])

# Constraint sum(p) = 1
cons = {'type': 'eq', 'fun': lambda p:  (sum([p[i] for i in range(12)]) - 1) }

# Minimizing neg_log_lik
results = sciopt.minimize(neg_log_lik, x0 = init_parameters, 
                          method='SLSQP', bounds= bnds, constraints= cons)

results.x                                    # point estimates for p

#   array([1.00000000e-15, 2.94179308e-02, 5.88243586e-02, 8.82394605e-02,
#          5.88243586e-02, 2.94059735e-02, 1.76454713e-01, 1.00000000e-15,
#          2.94134577e-02, 2.94135714e-02, 2.05849197e-01, 2.94156978e-01])

Refer scipy.optimize.minimize documentation for details on above implementation.

Mankind_008
  • 2,158
  • 2
  • 9
  • 15