I see two main aspects of your problem to comment on.
1) Convergence of the two methods.
Your first method is convergent, but not to as high of a tolerance as you have set. Using the L1-supnorm, np.max(np.abs(x_new1 - x_new))
, to compare the previous iteration value to the current iteration value shows that your first update mechanism converges at a tolerance of a little over 1e-14 (you can calculate the corresponding convergence level of phi
). Your second method might not be convergent at all. The routine is being cut off by your maximum iteration bound, but even with 1000 iterations, the L1-supnorm under the second method never falls below 0.009, with iteration steps in general much smaller than under the first routine. Further, increasing the iteration bound decreases its performance. I would therefore attribute the accuracy of the second method more to your initial guess rather than algorithm performance.
2) How to increase performance
There are two suggestions I would give right off the bat. First, think about how you are generating your initial guess. Newton-type optimization methods are notoriously sensitive to initial conditions. I'm not sure if your current guess has any theoretical underpinnings (right now, it looks like you just picked a random value and repeated for all parameters), but if not, think about some way to inform it better (perhaps a simplified solution method - OLS is used sometimes for initial values, but you would probably have to think of something else, given that your system is underdetermined). Second, your method to deal with negative parameter values (x_new1[x_new1<0] = 0.001
) is very ad-hoc and does not help your algorithm. If the parameters should be non-negative, you are actually solving a constrained optimization problem and that should be taken into account in your algorithm, especially in the Gradient. For theoretical details on both initial-guess sensitivity of Newton-type algorithms and constrained optimization algorithms, any good textbook on nonlinear optimization/programing should help. A good possible choice would be Boyd and Vandenberghe's Convex Optimization, which Boyd provides for free from his website here. For implementation/code, it's hard to beat Numerical Recipes. Depending on which version you look at, it will either give you code in C++, C, or Fortran. However, you should be able to convert their algorithms to Python/NumPy fairly painlessly.
Concerning your last question about np.linalg.norm
, I'm not sure to what part of your code you're referring.
EDIT
That being said, I'm not sure why you are doing any of this at all. The paper you are trying to use includes code that can be almost exactly replicated in Python. In the paper, the author is using the biconjugate gradient method, which he applies in (4.1.3) and (4.1.4) using the bicg
function in Matlab. Rather than trying to write your own optimization routine, use the corresponding bicg
method included in the Python module scipy
. It can be imported with from scipy.sparse.linalg import bicg
.
As a general note for optimization routines: if you don't have to write them yourself, don't. For the most part, there has already been someone who has spent most of their life developing efficient implementation of the method you need. Often, especially with Python, these are made freely available (with the exception of proprietary solvers). So, unless numerical methods are your thing, your time would likely be better spent figuring out the most efficient way to use those pre-written routines. In this example that would include understanding the sparsity structure the author of your paper defined in equation (4.1.2). Without that, the paper just become a matter of calculating the solution with equation (3.2.3) which you could likely do (albeit inefficiently, given that you're not using the sparsity structure) with np.linalg.solve
. Or, since you have an underdetermined system, you could use scipy.optimize.nnls
, which also has the benefit of keeping your values non-negative. There's a nice example of its usage here.