0

I am trying to build a transitions matrix from Panel data observations in order to obtain the ML estimators of a weighted transitions matrix. A key step is obtaining the individual likelihood function for individuals. Say you have the following data frame:

ID          Feature1  Feature2  Transition
120421006   10000        1         ab
120421006   12000        0         ba
120421006   10000        1         ab
123884392    3000        1         ab
123884392    2000        0         ba
908747738    1000        1         ab

The idea is to return, for each agent, the log-likelihood of his path. For agent 120421006, for instance, this boils down to (ignoring the initial term)

LL = log(exp(Yab)/1 + exp(Yab)) + log(exp(Yba) /(1 + exp(Yba))) + log(exp(Yab)/1 + exp(Yab))

i.e,

log(exp(Y_transition)/(1 + exp(Y_transition)))

where Y_transition = xFeature1 + yFeature2 for that transition, and x and y are unknowns.

For example, for individual 120421006, this would boil down to an expression with three elements, since he transitions thrice, and the function would return

LL = log(exp(10000x + 1y)/ 1 + exp(10000x + 1y)) +

log(exp(12000x + 0y)/ 1 + exp(12000x + 0y)) +

log(exp(10000x + 1y)/ 1 + exp(10000x + 1y))

And here's the catch: I need x and y to return as unknowns, since the objective is to obtain a sum over the likelihoods of all individuals in order to pass it to an ML estimator. How would you automate a function that returns this output for all IDs?

Many thanks in advance

Arrebimbomalho
  • 176
  • 1
  • 12

2 Answers2

1

Create the function:

fun=function(x){
a=paste0("exp(",x[1],"*x","+",x[2],"*y)")
parse(text=paste("sum(",paste0("log(",a,"/(1+",a,"))"),")"))
}

by(test[2:3],test[,1],fun)

sum(log(exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y)/(1 + 
    exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y))))
-------------------------------------------------------------------- 
sum(log(exp(c(3000, 2000) * x + c(1, 0) * y)/(1 + exp(c(3000, 
    2000) * x + c(1, 0) * y))))
-------------------------------------------------------------------- 
sum(log(exp(1000 * x + 1 * y)/(1 + exp(1000 * x + 1 * y))))

taking an example of x=0 and y=3 we can solve this:

x=0
y=3
sapply(by(test[2:3],test[,1],fun),eval)
[1] -0.79032188 -0.74173453 -0.04858735

in your example above:

x=0
y=3
 log(exp(10000*x + 1*y)/ (1 + exp(10000*x + 1*y))) +#There should be paranthesis
  log(exp(12000*x + 0*y)/ (1 + exp(12000*x + 0*y))) + 
  log(exp(10000*x + 1*y)/( 1 + exp(10000*x + 1*y)))
[1] -0.7903219

To get what you need within the comments:

fun1=function(x){
    a=paste0("exp(",x[1],"*x","+",x[2],"*y)")
    paste("sum(",paste0("log(",a,"/(1+",a,"))"),")")
    }

paste(by(test[2:3],test[,1],fun1),collapse = "+")
1] "sum( log(exp(c(10000, 12000, 10000)*x+c(1, 0, 1)*y)/(1+exp(c(10000, 12000, 10000)*x+c(1, 0, 1)*y))) )+sum( log(exp(c(3000, 2000)*x+c(1, 0)*y)/(1+exp(c(3000, 2000)*x+c(1, 0)*y))) )+sum( log(exp(1000*x+1*y)/(1+exp(1000*x+1*y))) )"

But this doesnt make sense why you would group them and then sum all of them. That is same as just summing them without grouping them using the ID which would be simpler and faster

Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Gr8! I will probably export the function consisting on the sums of the likelihoods to either Java or C++, since the real dataset chokes R, although trying the optim.R function that has some iterative capabilities and algorithms is probably a good idea...BTW, do you know how to obtain a single function consisting on the sum of likelihoods with constants from your great code – Arrebimbomalho Feb 09 '18 at 19:57
  • Am sorry I don't understand what you mean – Onyambu Feb 09 '18 at 20:03
  • Well, your code in the first rectangle returns the expression I was looking for (the likelihood of the agents with x and y left as variables) and I was not wondering whether there is a way of getting it all into a big expression, like................................................................sum(log(exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y)/(1 + exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y)))) sum(log(exp(c(3000, 2000) * x + c(1, 0) * y)/(1 + exp(c(3000, 2000) * x + c(1, 0) * y)))) , i.e, the summed likelihoods of the agents but still with the variables x and y free. – Arrebimbomalho Feb 09 '18 at 20:22
  • so you want to sum all the likelihoods? – Onyambu Feb 09 '18 at 20:28
  • Kind of. I want to obtain an expression consisting of the sum of the individual likelihoods, but with and y still free. So if there is some way ofobtaining automatically the expression sum(log(exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y)/(1 + exp(c(10000, 12000, 10000) * x + c(1, 0, 1) * y)))) + sum(log(exp(c(3000, 2000) * x + c(1, 0) * y)/(1 + exp(c(3000, 2000) * x + c(1, 0) * y)))) that would be great(this is really concatenating the output of your function above) – Arrebimbomalho Feb 09 '18 at 20:35
  • The idea is to then pass the function to a proper solver, using, say BFGS or conjugate Gradient in a language that can handle big data and big inputs... – Arrebimbomalho Feb 09 '18 at 20:36
  • I can do that! Just do not parse the equations at the end: – Onyambu Feb 09 '18 at 20:42
  • in the function remove the `parse(text=` part and return only the paste. Then on doing the by just collapse them with a `+` sign – Onyambu Feb 09 '18 at 20:49
1

First you have to decide how flexible your function has to be. I am leaving it fairly rigid, but you can alter it at your flavor.

First, you have to input the initial guess parameters, which you will supply in the optimizer. Then, declare your data and variables to be used in your estimation.

Assuming you will always have only 2 variables (you can change it later)

y <- function(initial_param, data, features){

  x = initial_param[1]
  y = initial_param[2]

  F1 = data[, features[1]]
  F2 = data[, features[2]]

  LL = log(exp(F1 * x + F2 * y) / (1 + exp(F1 * x + F2 * y)))

  return(-sum(LL))
}

This function returns the sum of minus the log likelihood, given that most optimizers try to find the parameters at which your function reaches a minimum, by default.

To find your parameters just supply the below function with your likelihood function y, the initial parameters, data set and a vector with the names of your variables:

nlm(f = y,  initial_param = your_starting_guess, data = your_data,
                  features = c("name_of_first_feature", "name_of_second_feature"), iterlim=1000, hessian=F)
Felipe Alvarenga
  • 2,572
  • 1
  • 17
  • 36
  • Thanks, Felipe! I just need the log likelihood and not to actually solve the thing, since the real dataset has a huge number of variables, individuals and transitions, and it might be better to transfer the thing to C and use a proper optimizer...if I ask for return(LL) in the last part of your code, it will return a list of loglikelihoods for the agents, right? (I cannot test the code right right now :) ) Then if I could get the sum over the elements of the list I would be set... – Arrebimbomalho Feb 09 '18 at 19:57
  • `sum(LL)` will return a list of likelihoods for each observation. You should then collapse that result into the agent level if you want to have the sum of the likelihood for each agent. – Felipe Alvarenga Feb 09 '18 at 20:14
  • Ok. But I mean if there is some way of getting just the expression of the likelihood for each agent without actually solving...the output should return LL = log(exp(F1 * x + F2 * y) / (1 + exp(F1 * x + F2 * y))) with the values F1 and F2 for eac agent, but x and y as unknowns...the solution step is for later once this step is scalable :) – Arrebimbomalho Feb 09 '18 at 20:18