NOTE : Below post is valid for Package version 1.4.0 and Before.
Estimating the shape parameters of Beta-Binomial Distribution
Introduction
I wrote a blog post earlier this month to understand the optimization functions in R and compare them. Here I am taking my time to go through one function at a time, only when there are more than one analytical method to use.
Today I will scrutinize the optim function which has six analytical methods. Setting the stage, we have the Beta-Binomial distribution and Binomial Outcome data, and we need to estimate proper shape parameters which would minimize the Negative Log Likelihood value or Maximize the Log Likelihood value of the Beta-Binomial distribution for the above Binomial Outcome data. In this case Alcohol Consumption data from the fitODBOD package will be used.
In this blog post we focus on the process time to optimization, estimated shape parameters, minimized Negative Log Likelihood value, expected frequencies, p-value and Over-dispersion in tables.
Below are the six analytical methods in concern
Nelder Mead
BFGS
CG
L-BFGS-B
SANN
Brent
Alcohol Consumption data has two sets of frequency values but only values from week 1 will be used. Below is the the Alcohol Consumption data, where number of observations is 399 and the Binomial Random variable is a vector of values from zero to seven.
Reading through the optim function brief from the previous post it will help the reader regarding operational questions of the function.
So for the initial parameters of a=0.1 and b=0.2 we will be finding estimated parameters from different analytical methods which would minimize the Negative Log Likelihood value of the Beta-Binomial distribution.
First we are transforming the given EstMLEBetaBin function to satisfy the optim function conditions.
# new function to facilitate optim criteria# only one input but has two elementsforoptim<-function(a){EstMLEBetaBin(x=Alcohol_data$Days, freq=Alcohol_data$week1,a=a[1],b=a[2])}
So the foroptim function can be used as above and parameters are estimated for \(\alpha\) and \(\beta\) (or a, b) for the Alcohol Consumption data week 1. Further the optim function can be scrutinized as below.
package : stats
No of Inputs: 7
Minimum required Inputs : 2
Class of output : list
No of outputs: 5
No of Analytical Methods : 6
Default Method : Nelder-Mead
Nelder and Mead method
Default analytical method is Nelder and Mead method. According to the documentation it uses only function values and is robust but relatively slow. It will work reasonably well for non-differential functions.
Reference : Nelder, J.A. and Mead, R., 1965. A simplex method for function minimization. The computer journal, 7(4), pp.308-313.
Below is the code of using optim function with Nelder and Mead analytical method.
# optimizing values for a,b using default analytical method or Nelder and MeadNM_answer<-optim(par=c(0.1,0.2),fn=foroptim)# the outputsNM_answer$par# estimated values for a, bNM_answer$value# minimized function value NM_answer$counts# see the documentation to understandNM_answer$convergence# indicates successful completionNM_answer$message# additional information# fitting the Beta-Binomial distribution with estimated shape parameter valuesfitBetaBin(Alcohol_data$Days,Alcohol_data$week1,NM_answer$par[1],NM_answer$par[2])
BFGS method
The documentation indicates that BFGS is a Quasi-Newton method (also known as a variable metric algorithm), specifically that published simultaneously in 1970 by Broyden, Fletcher, Goldfarb and Shanno. This uses function values and gradients to build up a picture of the surface to be optimized.
Reference : Broyden, C.G., 1967. Quasi-Newton methods and their application to function minimization. Mathematics of Computation, 21(99), pp.368-381.
Below is the code for using optim function with BFGS analytical method
# optimizing values for a,b using BFGS inputsBFGS_answer<-optim(par=c(0.1,0.2),fn=foroptim,method ="BFGS")# the outputsBFGS_answer$par# estimated values for a, bBFGS_answer$value# minimized function value BFGS_answer$counts# see the documentation to understandBFGS_answer$convergence# indicates successful completionBFGS_answer$message# additional information# fitting the Beta-Binomial distribution with estimated shape parameter valuesfitBetaBin(Alcohol_data$Days,Alcohol_data$week1,BFGS_answer$par[1],BFGS_answer$par[2])
CG method
The documentation indicates the Method CG is a conjugate gradients method based on that by Fletcher and Reeves (1964) (but with the option of Polak–Ribiere or Beale–Sorenson updates). Conjugate gradient methods will generally be more fragile than the BFGS method, but as they do not store a matrix they may be successful in much larger optimization problems.
Reference : Fletcher, R. and Reeves, C.M., 1964. Function minimization by conjugate gradients. The computer journal, 7(2), pp.149-154.
Using CG method with optim function is explained below
# optimizing values for a,b using CG inputsCG_answer<-optim(par=c(0.1,0.2),fn=foroptim,method ="CG")# the outputsCG_answer$par# estimated values for a, bCG_answer$value# minimized function value CG_answer$counts# see the documentation to understandCG_answer$convergence# indicates successful completionCG_answer$message# additional information# fitting the Beta-Binomial distribution with estimated shape parameter valuesfitBetaBin(Alcohol_data$Days,Alcohol_data$week1,CG_answer$par[1],CG_answer$par[2])
L-BFGS-B method
Method L-BFGS-B is that of Byrd et. al. (1995) which allows box constraints, that is each variable can be given a lower and/or upper bound. The initial value must satisfy the constraints. This uses a limited-memory modification of the BFGS quasi-Newton method. If non-trivial bounds are supplied, this method will be selected, with a warning.
Reference : Byrd, R.H., Lu, P., Nocedal, J. and Zhu, C., 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5), pp.1190-1208.
Refer the below code chunk to under the L-BFGS-B method from optim function
# optimizing values for a,b using L-BFGS-B inputsL_BFGS_B_answer<-optim(par=c(0.1,0.2),fn=foroptim,method ="L-BFGS-B")# the outputsL_BFGS_B_answer$par# estimated values for a, bL_BFGS_B_answer$value# minimized function value L_BFGS_B_answer$counts# see the documentation to understandL_BFGS_B_answer$convergence# indicates successful completionL_BFGS_B_answer$message# additional information# fitting the Beta-Binomial distribution with estimated shape parameter valuesfitBetaBin(Alcohol_data$Days,Alcohol_data$week1,L_BFGS_B_answer$par[1],L_BFGS_B_answer$par[2])
SANN method
Method SANN is by default a variant of simulated annealing given in Belisle (1992). Simulated-annealing belongs to the class of stochastic global optimization methods. It uses only function values but is relatively slow. It will also work for non-differential functions. This implementation uses the Metropolis function for the acceptance probability.
By default the next candidate point is generated from a Gaussian Markov kernel with scale proportional to the actual temperature. If a function to generate a new candidate point is given, method SANN can also be used to solve combinatorial optimization problems. Temperatures are decreased according to the logarithmic cooling schedule as given in Belisle (1992, p.890); specifically, the temperature is set to \(temp / log(((t-1) %/% tmax)*tmax + exp(1))\), where \(t\) is the current iteration step and temp and tmax are specifiable via control.
Note that the SANN method depends critically on the settings of the control parameters. It is not a general-purpose method but can be very useful in getting to a good value on a very rough surface.
Reference : Belisle, C.J., 1992. Convergence theorems for a class of simulated annealing algorithms on R d. Journal of Applied Probability, 29(4), pp.885-895.
Below mentioned code chunk is simply using SANN method for optim function
# optimizing values for a,b using default inputsSANN_answer<-optim(par=c(0.1,0.2),fn=foroptim,method ="SANN")# the outputsSANN_answer$par# estimated values for a, bSANN_answer$value# minimized function value SANN_answer$counts# see the documentation to understandSANN_answer$convergence# indicates successful completionSANN_answer$message# additional information# fitting the Beta-Binomial distribution with estimated shape parameter valuesfitBetaBin(Alcohol_data$Days,Alcohol_data$week1,SANN_answer$par[1],SANN_answer$par[2])
Brent method
Brent Method is for one-dimensional problems only, using optimize(). It can be useful in cases where optim() is used inside other functions where only method can be specified, such as in mle from package stats4. Brent method does not work for our situation.
Reference : Brent, R.P., 2013. Algorithms for minimization without derivatives. Courier Corporation.
Summary of Time evaluation for different Analytical methods of optim function
Below considered table will compare the system process time for different analytical methods. In order to do this time comparison it is necessary to use the benchmark function of rbenchmark package. Below written code chunk provides the output in a table form which includes the analytical methods and their respective time values. The estimation process of the parameters where each method has been replicated 1000 times to receive a more accurate table for time values.
The table is in accordance to the elapsed time value column in the ascending order. According to this we can see that least time takes to the Nelder and Mead method and most time is taken to the SANN method. These times completely depends on the Negative Log Likelihood function you need to minimize, the data you provided, the number of estimators that needs to be estimated, the complexity of the function and finally computer’s processing power.
Summary of results after using the optim function for different analytical methods
After using the methods Nelder Mead, BFGS, CG, L-BFGS-B and SANN to estimate the shape parameters a, b we can use the estimated parameters in the function fitBetaBin. Using this function we can find expected frequencies for each of these analytical methods and compare p-values and over-dispersion. Further, understand if using different analytical methods had any effect on them.
According to the below table there is no significant changes between the expected frequencies, except while using SANN method. All five methods generate different Over-dispersion values after the first three decimal places. Negative Log Likelihood values and p values are same for all 5 methods until first three decimal places. This is a clear indication of it does not matter what analytical method we use the estimation will occur effectively but only efficiency will be affected.
BinomialRandomVariable
Frequency
NelderMead
BFGS
CG
LBFGSB
SANN
0
47
54.61
54.62
54.62
54.62
54.75
1
54
42
42
42
42
42.02
2
43
38.91
38.9
38.9
38.9
38.89
3
40
38.54
38.54
38.54
38.54
38.52
4
40
40.07
40.07
40.07
40.07
40.04
5
41
44
43.99
44
44
43.96
6
39
53.09
53.09
53.09
53.09
53.05
7
95
87.77
87.8
87.78
87.78
87.78
Total No of Observations
399
398.99
399.01
399
399
399.01
p-value
0.0902
0.0903
0.0901
0.0901
0.0901
Estimated a and b
a=0.7230707 b=0.5809894
a=0.7228930 b=0.5807279
a=0.7229414 b=0.5808477
a=0.7229432 b=0.5808496
a=0.7215669 b=0.5802982
Negative Log Likelihood
813.4571
813.4571
813.4571
813.4571
813.4573
Over Dispersion
0.4340165
0.4340992
0.4340675
0.4340668
0.4344303
Final Conclusion
We had 6 methods to compare but choosing one over the other is completely harmless to the final result of estimation as seen by our tables. And our situation forces us to not use the Brent method. The only issue is time, therefore I would recommend choose the best analytical method from the optim function based on your needs of output and research objective.