<>= library(emdbook) library(plotrix) library(bbmle) library(MASS) library(MCMCpack) library(coda) library(R2WinBUGS) library(Hmisc) source("chapskel.R") @ \section*{Summary} This chapter explores the technical methods required to find the quantities discussed in the previous chapter (maximum likelihood estimates, posterior means, and profile confidence limits). The first section covers methods of numerical optimization for finding MLEs and Bayesian posterior modes, the second section introduces Markov chain Monte Carlo, a general algorithm for finding posterior means and credible intervals, and the third section discusses methods for finding confidence intervals for quantities that are not parameters of a given model. \section{Introduction} Now we can think about the nitty-gritty details of fitting models to data. Remember that we're trying to find the parameters that give the maximum likelihood for the comparison between the fitted model(s) and the data. (From now on I will discuss the problem in terms of finding the minimum negative log-likelihood, although all the methods apply to finding maxima as well.) The first section focuses on methods for finding minima of curves and surfaces. These methods apply whether we are looking for maximum likelihood estimates, profile confidence limits, or Bayesian posterior modes (which are an important starting point in Bayesian analyses \citep{Gelman+1996}). Although there are many numerical minimization algorithms, I will only discuss the basic properties of few common ones (most of which are built into \R), and their strengths and weaknesses. Many of these methods are discussed in more detail by \cite{Press+1994}. The second section introduces \emph{Markov chain Monte Carlo} methods, which are the foundation of modern Bayesian analysis. MCMC methods feel a little bit like magic, but they follow simple rules that are not too hard to understand. The last section tackles a more specific but very common problem, that of finding confidence limits on a quantity that is not a parameter of the model being fitted. There are many different ways to tackle this problem, varying in accuracy and difficulty. It's useful to have several in your toolbox, and learning about them also helps you gain a deeper understanding of the shapes of likelihood and posterior probability surfaces. \section{Fitting methods} \subsection{Brute force/direct search} \newcommand{\pest}{p_{\mbox{est}}} The simplest way to find a maximum (minimum) is to evaluate the function for a wide range of parameter values and see which one gives the best answer. In \R, you would make up a vector of parameter values to try (perhaps a vector for each of several parameters); use \code{sapply} (for a single parameter) or \code{for} loops to calculate and save the negative log-likelihood (or posterior [log-]likelihood) for each value; then use \code{which(x==min(x))} (or \code{which.min(x)}) to see which value of the parameters gave the minimum. (You may be able to use \code{outer} to evaluate a matrix of all combinations of two parameters, but you have to be careful to use a vectorized likelihood function.) \begin{figure} <>= x=c(1,1.5,2,2.5,3,4,5,6,10,15,20) y=c(6,3.2,2,-3,3,2,0,-1,2,4,5) y2 = spline(x,y,n=200) plot(y2$x,y2$y,type="l",xlab="",ylab="",ylim=c(-3,15), axes=FALSE,xlim=c(0,20)) box() axis(side=3,line=-11,at=seq(0,5,by=0.5),labels=rep("",11)) axis(side=3,line=-8,at=seq(0,20,by=5),labels=rep("",5)) axis(side=3,line=-5,at=seq(5,20,by=0.5),labels=rep("",31)) axis(side=3,line=-2,at=seq(0,20,by=1),labels=rep("",21)) mtext(side=3,line=-10,at=0,adj=0,"sampling grid 4") mtext(side=3,line=-7,at=0,adj=0,"sampling grid 3") mtext(side=3,line=-4,at=0,adj=0,"sampling grid 2") mtext(side=3,line=-1,at=0,adj=0,"sampling grid 1") mtext(side=3,line=-12,at=0,adj=0,expression(p[lower])) mtext(side=3,line=-12,at=20,adj=1,expression(p[upper])) mtext(side=3,line=-9,at=12.5,adj=0.5,expression(Delta*p)) yht <- 8.5 arrows(c(11.5,13.5),c(yht,yht),c(10,15),c(yht,yht),length=0.1,angle=20) points(y2$x[which.min(y2$y)],min(y2$y),pch=1) restr <- y2$x>5 xrestr <- y2$x[restr] yrestr <- y2$y[restr] points(xrestr[which.min(yrestr)],min(yrestr),pch=16) #points(x,y) @ \caption{Direct search grids for a hypothetical negative log-likelihood function. Grids~1 and~4 will eventually find the correct minimum (open point). Grids~2 and 3 will miss it, finding the false minimum (closed point) instead. Grid 2 misses becauses its range is too small; grid 3 misses because its resolution is too small.} \label{fig:direct} \end{figure} The big problem with direct search is speed, or lack of it: the resolution of your answer is limited by the resolution (grid size) and range of your search, and the time it takes is the product of the resolution and the range. Suppose you try all values between $\plower$ and $\pupper$ with a resolution $\Delta p$ (e.g. from 0 to 10 by steps of 0.1). Figure~\ref{fig:direct} shows a made-up example---% somewhat pathological, but not much worse than some real likelihood surfaces I've tried to fit. Obviously, the point you're looking for must fall in the range you're sampling: sampling grid \#2 in the figure misses the real minimum by looking at too small a range. You can also miss a sharp, narrow minimum, even if you sample the right range, by using too large a $\Delta p$ --- sampling grid \#3 in Figure~\ref{fig:direct}. There are no simple rules for determining the range and $\Delta p$ to use. You must know the ecological meaning of your parameters well enough that you can guess at an appropriate order of magnitude to start with. For small numbers of parameters you can draw curves or contours of your results to double-check that nothing looks funny, but for larger models it's difficult to draw the appropriate surfaces. Furthermore, even if you use an appropriate sampling grid, you will only know the answer to within $\Delta p$. If you use a smaller $\Delta p$, you multiply the number of values you have to evaluate. A good general strategy for direct search is to start with a fairly coarse grid (although not as coarse as sampling grid \#3 in Figure~\ref{fig:direct}), find the sub-region that contains the minimum, and then ``zoom in'' on that region by making both the range and $\Delta p$ smaller, as in sampling grid \#4. You can often achieve fairly good results this way, but almost always less efficiently than with one of the more sophisticated approaches covered in the rest of the chapter. The advantages of direct search are (1) it's simple and (2) it's so dumb that it's hard to fool: provided you use a reasonable range and $\Delta p$, it won't be led astray by features like multiple minima or discontinuities that will confuse other, more sophisticated approaches. The real problem with direct search is that it's slow because it takes no advantage of the geometry of the surface. If it takes more than a few seconds to evaluate the likelihood for a particular set of parameters, or if you have many parameters (which leads to many \emph{many} combinations of parameters to evaluate), direct search won't be feasible. For example, to do direct search on the parameters of the Gamma-distributed myxomatosis data (ignoring the temporal variation), we would set the range and grid size for shape and scale: <<>>= data(MyxoTiter_sum) myxdat = subset(MyxoTiter_sum,grade==1) @ <<>>= gm = mean(myxdat$titer) cv = var(myxdat$titer)/mean(myxdat$titer) shape0=gm/cv scale0=cv @ In Chapter~\ref{chap:lik}, we used the method of moments to determine starting values of shape (\Sexpr{round(shape0,1)}) and scale (\Sexpr{round(scale0,2)}). We'll try shape parameters from 10 to 100 with $\Delta$ shape=1, and scale parameters from 0.01 to 0.3 with $\Delta$ scale=0.01. <<>>= shapevec = 10:100 scalevec = seq(0.01,0.3,by=0.01) @ Using the \code{gammaNLL1} negative log-likelihood function from p.~\pageref{gammalik}: <>= gammaNLL1 = function(shape,scale) { -sum(dgamma(myxdat$titer,shape=shape,scale=scale,log=TRUE)) } @ <<>>= surf = matrix(nrow=length(shapevec),ncol=length(scalevec)) for (i in 1:length(shapevec)) { for (j in 1:length(scalevec)) { surf[i,j] = gammaNLL1(shapevec[i],scalevec[j]) } } @ Draw the contour plot: <<>>= contour(shapevec,scalevec,log10(surf)) @ Or you can do this more automatically with the \code{curve3d} function from the \code{emdbook} package: <>= curve3d(log10(gammaNLL1(x,y)),from=c(10,0.01),to=c(100,0.3),n=c(91,30),sys3d="image") @ The \code{gridsearch2d} function (also in \code{emdbook}) will let you zoom in on a negative log-likelihood surface: <>= gridsearch2d(gammaNLL1,v1min=10,v2min=0.01,v1max=100,v2max=0.3,logz=TRUE) @ \subsection{Derivative-based methods} The opposite extreme from direct search is to make strong assumptions about the geometry of the likelihood surface: typically, that it is smooth (continuous with continuous first and second derivatives) and has only one minimum. Then at the minimum point the derivative is zero: the \emph{gradient}, the vector of the derivatives of the surface with respect to all the parameters, is a vector of all zeros. Most numerical optimization methods other than direct search use some variant of the criterion that the derivative must be close to zero at the minimum in order to decide when to stop. So-called \emph{derivative-based} methods also use information about the first and second derivatives to move quickly to the minimum. <>= ## run Newton's method on likelihood for ## binomial data data(ReedfrogPred) x = subset(ReedfrogPred,pred=="pred" & density==10 & size=="small") k = x$surv N = 10 binomlik = function(p,x=k) { -sum(dbinom(k,prob=p,size=N,log=TRUE)) } ## D(expression(k*log(p)+(N-k)*log(1-p)),"p") binomlik.deriv = function(p,x=k) { -sum(k/p-(N-k)/(1-p)) } ## D(D(expression(k*log(p)+(N-k)*log(1-p)),"p"),"p") binomlik.deriv2 = function(p,x=k) { -sum(-k/p^2-(N-k)/(1-p)^2) } newton = function(start=0.6,graphics=TRUE, tol=1e-4,maxit=20, dfun=binomlik.deriv, dfun2=binomlik.deriv2, ...) { resmat = matrix(nrow=maxit,ncol=5) i = 1 ## initial values x.guess = start x.dval = dfun(x.guess,...) x.dval2 = dfun2(x.guess,...) while (abs(x.dval)>tol && i>= x0 = c(0,0.2,0.45,0.6,0.65,0.75,0.9,1) y0 =c(-1,-0.5,0.039,0.29,0.41,0.53,0.6,0.7) z = splinefun(x=x0,y=y0) x1 = 0.7 y1 =z(x1) y1d = z(x1,1) zz = x1-y1/y1d yint= y1-x1*y1d dx = 0.1 z2 = y1-y1d*dx ## z(x=x1-dx) z3 = y1+y1d*dx ## z(x=x1+dx) ## @ <>= op=par(lwd=2,bty="l",las=1,cex=1.5,mgp=c(2.5,1,0)) plot(0:1,0:1,type="n",ann=FALSE,axes=FALSE, xlim=c(0.2,0.9),ylim=c(0,0.8)) abline(h=0) curve(z(x),add=TRUE) abline(v=x1) abline(a=yint,b=y1d) points(x1,y1,pch=16) segments(x1-0.2,y1,x1+0.1,y1,lty=3) text(x1-0.21,y1,adj=1,expression(f(x[0]))) segments(c(x1-dx,x1+dx),c(z2,z2),c(x1+dx,x1+dx),c(z2,z3),col="gray") text(x1+dx+0.01,z2+0.04,expression(f*minute(x[0])),adj=0) mtext(side=1,at=x1,expression(x[0]),cex=1.5) mtext(side=1,at=zz,expression(x[1]),cex=1.5) points(zz,0) par(xpd=NA) arrows(zz,-0.02,x1,-0.02,code=3,length=0.1) par(xpd=FALSE) mtext(side=1,at=(zz+x1)/2,expression(f(x[0])/f*minute(x[0])),cex=1.5) par(op) @ \end{center} \caption{Newton's method: schematic} \label{fig:newton1} \end{figure} \begin{figure} <>= op=par(mfrow=c(2,1),lwd=2,bty="l",las=1,cex=1.5,mgp=c(2.5,1,0)) n0 = newton(start=0.6) n1 = n0[1:3,] par(mar=c(0,4,2,2)+0.1) plot.newton(n1,xlim=c(0.58,0.8),ylim=c(-28,10), xlab="",axes=FALSE,n.cex=2.5,pt.cex=1, g.lty=3, ylab="Derivative of -L") axis(side=2) axis(side=1,at=seq(0.6,0.8,by=0.05),labels=rep("",5)) text(0.6,0.5,"start",pos=3) box() par(mar=c(4,4,0,2)+0.1) pvec=seq(0.6,0.8,length=100) par(xpd=NA) plot(pvec,sapply(pvec,binomlik),xlab="p (probability of success per trial)", ylab="Negative log-likelihood",type="l",lwd=3,ylim=c(7,10), axes=FALSE,xlim=c(0.58,0.8)) par(xpd=FALSE) axis(side=1) axis(side=2,at=7:10) box() points(n1$guess,sapply(n1$guess,binomlik),pch=21,cex=2.5,bg="white") points(n1$guess,sapply(n1$guess,binomlik),pch=as.character(1:nrow(n1))) par(op) @ \caption{Newton's method: Top: numbered circles represent sequential guesses for the parameter $p$ (starting from guess \#1 at 0.6); a dotted gray line joins the current guess with the value of the derivative for that value of the parameter; and solid lines ``shoot'' over to the horizontal axis to find the next guess for $p$. Bottom: Likelihood curve.} \label{fig:newton2} \end{figure} The simplest derivative-based method is \emph{Newton's method}, also called the \emph{Newton-Raphson} method, Newton's method is a general algorithm for discovering the places where a function crosses zero, called its \emph{roots}. In general, if we have a function $f(x)$ and a starting guess $x_0$, we calculate the value $f(x_0)$ and the value of the derivative at $x_0$, $f'(x_0)$. Then we extrapolate linearly to try to find the root: $x_1=x_0-f(x_0)/f'(x_0)$ (Figure~\ref{fig:newton1}). We iterate this process until we reach a point where the absolute value of the function is ``small enough'' --- typically $10^{-6}$ or smaller. While calculating the derivatives of the objective function analytically is the most efficient procedure, it is often convenient and sometimes necessary to approximate the derivatives numerically using finite differences: \begin{equation} \frac{d f(x)}{d x} = \lim_{\Delta x \to 0} \frac{\Delta f(x)}{\Delta x} \approx \frac{f(x+\Delta x)-f(x)}{\Delta x}, \qquad \mbox{for small } \Delta x \end{equation} \R's \code{optim} function uses finite differences by default, but it sometimes runs into trouble with both speed (calculating finite differences for an $n$-parameter model requires an additional $n$ function evaluations for each step) and stability. Calculating finite differences requires you to pick a $\Delta x$: \code{optim} uses $\Delta x=0.001$ by default, but you can change this with \code{control=list(ndeps=c(...))} within an \code{optim} or \code{mle2} call, where the dots stand for a vector of $\Delta x$ values, one for each parameter. You can also change the effective value of $\Delta x$ by changing the parameter scale, \code{control=list(parscale=c(...))}; $\Delta x_i$ is defined relative to the parameter scale, as \code{parscale[i]*ndeps[i]}. If $\Delta x$ is too large, the finite difference approximation will be poor; if it is too small, you may run into trouble with round-off error. In minimization problems, we actually want to find the root of the \emph{derivative} of the (negative) log-likelihood function, which means we need to find the second derivative of the objective function. That is, instead of taking $f(x)$ and calculating $f'(x)$ by differentiation or finite differencing to figure out the slope and project our next guess, Newton's method for minima takes $f'(x)$ and calculates $f''(x)$ (the curvature) to approximate where $f'(x)=0$. Using the binomial seed predation data from the last chapter and starting with a guess of $p=0.6$, Figure~\ref{fig:newton2} and the following table show how Newton's method converges quickly to $p=0.75$ (for clarity, the figure shows only the first three steps of the process): <>= n2 <- n0[,2:4] n2[,1] <- round(n2[,1],6) n2[,2:3] <- round(n2[,2:3],3) colnames(n2) <- c("Guess ($x$)","$f'(x)$","$f''(x)$") @ <>= latex(n2,file="",table.env=FALSE,title="") @ Newton's method is simple and converges quickly. The precision of the answer rapidly increases with additional iterations. It also generalizes easily to multiple parameters: just calculate the first and second partial derivatives with respect to all the parameters and use linear extrapolation to look for the root. However, if the initial guess is poor or if the likelihood surface has a funny shape, Newton's method can misbehave --- overshooting the right answer or oscillating around it. Various modifications of Newton's method mitigate some of these problems \citep{Press+1994}, and other methods called ``quasi-Newton'' methods use the general idea of calculating derivatives to iteratively approximate the root of the derivatives. The \emph{Broyden-Fletcher-Goldfarb-Shanno} (BFGS) algorithm built into \R's \code{optim} code is probably the most widespread quasi-Newton method. Use BFGS whenever you have a relatively well-behaved (i.e., smooth) likelihood surface, reasonable starting conditions, and efficiency is important. If you can calculate an analytical formula for the derivatives, write an \R\ function to compute it for a particular parameter vector, and supply it to \code{optim} via the \code{gr} argument (see the examples in \code{?gr}), you will avoid the finite difference calculations and get a faster and more stable solution. As with all optimization methods, you \emph{must} be able to estimate reasonable starting parameter values. Sometimes a likelihood surface will become flat for really bad fits --- once the parameters are sufficiently far off the correct answer, changing them may make little difference in the goodness of fit. Since the log-likelihood will be nearly constant, its derivative will be nearly zero. Derivative-based methods that start from implausible values (or any optimization procedure that uses a ``flatness'' criterion to decide when to stop, including most of those built into \code{optim}) may find this worst-case scenario instead of the minimum you sought. More often, specifying ridiculous starting values will give infinite or \code{NA} values, which \R's optimization routines will choke on. Although most of the optimization routines can handle occasional \code{NA}s, the negative log-likelihood must be finite for the starting values. You should always test your negative log-likelihood functions at the proposed starting conditions to make sure they give finite answers; also try tweaking the parameters in the direction you think might be toward a better fit, and see if the negative log-likelihood decreases. If you get non-finite values (\code{Inf}, \code{NA}, or \code{NaN}), check that your parameters are really sensible. If you think they should be OK, check for \code{NA}s in your data, or see if you have made any numerical mistakes like dividing by zero, taking logarithms of zero or negative numbers, or exponentiating large numbers (\R\ thinks \code{exp(x)} is infinite for any $\text{\code{x}}>710$). Exponentiating negative numbers of large magnitude is not necessarily a problem, but if they ``underflow'' and become zero (\R\ thinks \code{exp(x)} is 0 for any $\text{\code{x}}<-746$) you may get errors if you divide by them or calculate a likelihood of a data value that has zero probability. Some log-likelihood functions contain terms like $x \log(x)$, which we can recognize should be zero but \R\ treats as \code{NaN}. You can use \code{if} or \code{ifelse} in your likelihood functions to work around special cases, for example \code{ifelse(x==0,0,x*log(x))}. If you have to, break down the \code{sum} in your negative log-likelihood function and see which particular data points are causing the problem (e.g. if \code{L} is a vector of negative log-likelihoods, try \code{which(!is.finite(L))}). If your surface is \emph{not} smooth --- if it has discontinuities or if round-off error or noise makes it ``bumpy'' --- then derivative-based methods will work badly, particularly with finite differencing. When derivative-based methods hit a bump in the likelihood surface, they often project the next guess to be very far away, sometimes so far away that the negative log-likelihood calculation makes no sense (e.g. negative parameter values). In this case, you will need to try an optimization method that avoids derivatives. \subsection{Derivative-free methods} In between the brute force of direct search and the sometimes delicate derivative-based methods are \emph{derivative-free} methods, which use some information about the surface but do not rely on smoothness. \subsubsection{One-dimensional algorithms} One-dimensional minimization is easy because once you have bracketed a minimum (i.e., you can find two parameter values, one of which is above and one of which is below the parameter value that gives the minimum negative log-likelihood) you can always find the minimum by interpolation. \R's \code{optimize} function is a one-dimensional search algorithm that uses \emph{Brent's method}, which is a combination of \emph{golden section search} and \emph{parabolic interpolation} \citep{Press+1994}. Golden-section search attempts to ``sandwich'' the minimum, based on the heights (negative log-likelihoods) of a few points; parabolic interpolation fits a quadratic function (a parabola) to three points at a time and extrapolates to the minimum of the parabola. If you have a one-dimensional problem (i.e. a one-parameter model), \code{optimize} can usually solve it quickly and precisely. The only potential drawback is that \code{optimize}, like \code{optim}, can't easily calculate confidence intervals. If you need confidence intervals, use \code{mle2} instead\footnote{\code{mle} and \code{mle2} use \code{method="BFGS"} by default. Nelder-Mead optimization (see below) is unreliable in one dimension and \R\ will warn you if you try to use it to optimize a single parameter.}. \subsubsection{Nelder-Mead simplex} \label{sec:NM} The simplest and probably most widely used derivative-free minimization algorithm that works in multiple dimensions (it's \code{optim}'s default) is the \emph{Nelder-Mead simplex}, devised by Nelder and Mead in 1965 % and called the ``amoeba'' method % by \cite{Press+1994} because it works by % implementing rules that allow a cloud of points in % parameter space to ``crawl'' toward the lowest point in % a vaguely amoeboid fashion \footnote{The Nelder-Mead simplex is completely unrelated to the simplex method in linear programming, which is a method for solving high-dimensional \emph{linear} optimization problems with constraints.}. Rather than starting with a single parameter combination (which you can think of as a point in $n$-dimensional parameter space) Nelder-Mead picks $n+1$ parameter combinations that form the vertices of an initial \emph{simplex}---the simplest shape possible in $n$-dimensions% \footnote{However, you only need to specify a single starting point; \R\ automatically creates a simplex around your starting value.}. In two dimensions, a simplex is three points (each of which represents a pair of parameter values) forming a triangle; in three dimensions, a simplex is 4 points (each of which is a triplet of parameter values) forming a pyramid or tetrahedron; in higher dimensions, it's $n+1$ points which we just call an $n$-dimensional simplex. The Nelder-Mead algorithm then evaluates the likelihood at each vertex, which is the ``height'' of the surface at that point, and move the worst point in the simplex according to a simple set of rules: \begin{itemize} \item{start by going in what seems to the best direction by reflecting the high (worst) point in the simplex through the face opposite it;} \item{if the goodness-of-fit at the new point is better than the best (lowest) other point in the simplex, double the length of the jump in that direction;} \item{if this jump was bad---the height at the new point is worse than the second-worst point in the simplex---then try a point that's only half as far out as the initial try;} \item{if this second try, closer to the original, is also bad, then contract the simplex around the current best (lowest) point [not shown in Figure~\ref{fig:nm}].} \end{itemize} The Nelder-Mead algorithm works well in a wide variety of situations, although it's not foolproof (nothing is) and it's not particularly efficient. \begin{figure} \begin{center} \includegraphics{N-M} \end{center} \caption{Graphical illustration (after \cite{Press+1994}) of the Nelder-Mead simplex rules applied to a tetrahedron (a 3-dimensional simplex, used for a 3-parameter model).} \label{fig:nm} \end{figure} <>= data(MyxoTiter_sum) myxdat = subset(MyxoTiter_sum,grade==1) gammaNLL1 = function(shape,scale) { -sum(dgamma(myxdat$titer,shape=shape,scale=scale,log=TRUE)) } gammaNLL2 = function(p) { -sum(dgamma(myxdat$titer,shape=p[1],scale=p[2],log=TRUE)) } myxsurf <- curve3d(gammaNLL2(c(x,y)),from=c(15,0.02),to=c(60,0.3), n=c(61,61),sys3d="none") source("amoeba.R") plotsimplex.2d <- function(x,cols=1:5,lty,...) { v <- matrix(x[1:9],byrow=TRUE,ncol=3)[,-3] code <- x[10] lines(rbind(v,v[1,]),col=cols[code+1],lty=ltyvec[code+1],...) } ## 0: starting value ## 1: reflection ## 2: reflection + expansion ## 3: one-dimensional contraction ## 4: contract around lowest poin @ \begin{figure} <>= op=par(lwd=2,bty="l",las=1,cex=1.5) amres <- amoeba(gammaNLL2,c(20,0.05),info=TRUE) ainfo = amres$info ##ainfo[,c(1,4,7)] = log10(ainfo[,c(1,4,7)]) contour(myxsurf$x,myxsurf$y,myxsurf$z,drawlabels=FALSE,col="gray", levels=seq(0,7000,by=50),xlab="",ylab="scale") mtext(side=1,at=40,"shape",line=2,cex=1.5) colvec <- rep(1,4) ## gray(c(0,0,0.5,0.7,0.9)) ltyvec <- c(1,1,2,3,4) invisible(apply(ainfo,1,plotsimplex.2d,cols=colvec,lwd=2)) points(20,0.05,pch=21,bg="white") text(20,0.05,pos=1,"start") points(amres$estimate[1],amres$estimate[2],pch=21,bg="white") text(amres$estimate[1],amres$estimate[2],pos=1,"end") legend("bottomright", legend=c("reflection","reflect+expand","contract"), lty=ltyvec[2:4],col=colvec[2:4], bty="n",lwd=2,cex=0.7) ## bg="white", par(op) @ \caption{Track of Nelder-Mead simplex for the Gamma model of the myxomatosis titer data. Triangles indicating some moves are obscured by subsequent moves.} \label{fig:nm2} \end{figure} We give the Nelder-Mead algorithm a set of starting parameter values %(shape=20,scale=0.05) %as starting coordinates and it displaces these coordinates slightly to get its starting simplex. Thereafter, it takes steps alternating between simple reflection and expanded reflection, moving rapidly downhill across the contour lines and increasing both shape and scale parameters. Eventually it finds that it has gone too far, alternating reflections and contractions to ``turn the corner''. Once it has turned, it proceeds very rapidly down the contour line, alternating reflections again; after a total of \Sexpr{amres$iterations} cycles the surface is flat enough for the algorithm to conclude that it has reached a minimum. <>= contour(myxsurf$x,myxsurf$y,myxsurf$z,drawlabels=FALSE,col="gray", levels=seq(0,2200,by=50),xlim=c(46,51),ylim=c(0.135,0.155)) invisible(apply(ainfo,1,plotsimplex.2d,cols=colvec,lwd=2)) @ Nelder-Mead can be considerably slower than derivative-based methods, but it is less sensitive to discontinuities or noise in the likelihood surface, since it doesn't try to use fine-scale derivative information to navigate across the likelihood surface. % \subsubsection{Other options} % There are other derivative-free functions: for example, \emph{Powell's method} % works by minimizing along a single line ``transect'' at a time, using the % information gained from previous line searches to pick the next transect % to try \citep{Press+1994}; however, Powell's method isn't built into \R. \subsection{Stochastic global optimization: simulated annealing} \label{sec:simann} Stochastic global optimizers are a final class of optimization techniques, even more robust than the Nelder-Mead simplex and even slower. They are \emph{global} because unlike most other optimization methods they may be able to find the right answer even when the likelihood surface has more than one local minimum (Figure~\ref{fig:direct}). They are stochastic because they rely on adding random noise to the surface as a way of avoiding being trapped at one particular minimum. The classic stochastic optimization algorithm is the \emph{Metropolis algorithm}, or \emph{simulated annealing} \citep{Kirkpatrick+1983,Press+1994}. The physical analogy behind simulated annealing is that gradually cooling a molten metal or glass allows it to form a solid with few defects. Starting with a crude (``hot'') solution to a problem and gradually refining the solution allows us to find the global minimum of a surface even when it has multiple minima. The rules of simulated annealing are: \begin{itemize} \item{pick a starting point (set of parameters) and calculate the negative log-likelihood for those parameters;} \item{until your answer is good enough or you run out of time: \begin{itemize} \item{\textbf{A}. pick a new point (set of parameters) at random (somewhere near your old point);} \item{calculate the value of the negative log-likelihood there} \item{if the new value is better than the old negative log-likelihood, accept it and return to \textbf{A}} \item{if it's worse than the old value, calculate the difference in negative log-likelihood $\Delta (-L)=-L_{\mbox{new}}-(-L_{\mbox{old}})$. Pick a random number between 0 and 1 and accept the new value if the random number is less than $e^{-\Delta (-L)/k}$, where $k$ is a constant called the \emph{temperature}. Otherwise, keep the previous value. The higher the temperature and the smaller $\Delta (-L)$ (i.e., the less bad the new fit), the more likely you are to accept the new value. In mathematical terms, the acceptance rule is \begin{equation} \mbox{Prob}(\mbox{accept}) = \begin{cases} e^{-\frac{\Delta (-L)}{k}} & \mbox{if } \Delta (-L) > 0 \\ 1 & \mbox{if } \Delta (-L) < 0. \end{cases} \label{eq:simann} \end{equation} } \item{Return to \textbf{A} and repeat.} \end{itemize} } \item{Periodically (e.g., every 100 steps) lower the value of $k$ to make it harder and harder to accept bad moves.} \end{itemize} One variant of simulated annealing is available in \R\ as the \code{SANN} method for \code{optim} or \code{mle2}. <>= set.seed(1002) metres <- metropSB(gammaNLL2,c(20,0.05),retvals=TRUE,retfreq=1, nmax=2500) @ \begin{figure} <>= op=par(bty="l",las=1,cex=1.5,mgp=c(2.5,1,0)) myxsurf2 <- curve3d(gammaNLL2(c(x,y)),from=c(20,0.05),to=c(75,0.3), n=c(61,61),sys3d="contour",levels=seq(0,2200,by=50), drawlabels=FALSE,col="gray", xlab="shape",ylab="") mtext(side=2,at=0.17,"scale",line=3,cex=1.5,las=0) points(metres$retvals[,"p1"],metres$retvals[,"p2"], col=c("gray","black")[metres$retvals[,"accept"]+1], cex=0.3) tvec <- c(1,100,150,300,1000,1500,2000,2500) n <- length(tvec) x <- metres$retvals[tvec,"p1"] y <- metres$retvals[tvec,"p2"] symbols(x,y,rectangles=cbind(rep(2,n),rep(1,n)), inches=0.3,add=TRUE, bg="white") points(metres$estimate[1],metres$estimate[2],pch=21, bg="white") text(x,y,tvec,cex=0.5) @ \caption{Track of Metropolis-Szymura-Barton evaluations. The MSB algorithm starts at (20,0.05) (step 1), and moves quickly up to the central valley, but then wanders aimlessly back and forth along the valley.} \label{fig:msb1} \end{figure} <>= ## looking at contours of metrop output k1 = kde2d(log10(metres$retvals[,"p1"]),metres$retvals[,"p2"],n=50) contour(k1$x,k1$y,k1$z) @ <>= firstmin = which.min(metres$retvals[,"minval"]) @ \label{msb} Another variant of the Metropolis algorithm \cite[Metropolis-Szymura-Barton, MSB, \code{metropSB} in \code{emdbook}:][]% {SzymuraBarton1986}) varies the size of the change in parameters (the scale of the \emph{candidate distribution} or \emph{jump size}) rather than the temperature, and changes the jump size adaptively rather than according to a fixed schedule. Every successful jump increases the jump size, while every unsuccessful jump decreases the jump size. This makes the algorithm good at exploring lots of local minima (every time it gets into a valley, it starts trying to get out) but really bad at refining estimates (it has a hard time actually getting all the way to the bottom of a valley). To run MSB on the myxomatosis data: <>= MSBfit = metropSB(fn=gammaNLL2,start=c(20,0.05),nmax=2500) @ Figure~\ref{fig:msb1} shows a snapshot of where the MSB algorithm goes on our now-familiar likelihood surface for the myxomatosis data, with unsuccessful jumps marked in gray and successful jumps marked in black. The MSB algorithm quickly moves ``downhill'' from its starting point to the central valley, but then drifts aimlessly back and forth along the central valley. It does find a point close to the minimum. After \Sexpr{firstmin} steps, it finds a minimum of \Sexpr{round(metres$minimum,5)}, equal for all practical purposes to the Nelder-Mead simplex value of \Sexpr{round(amres$minimum,5)} --- % but Nelder-Mead took only \Sexpr{amres$function.calls} function evaluations to get there. Since MSB increases its jump scale when it is successful, and since it is willing to take small uphill steps, it doesn't stay near the minimum. While it always remembers the best point it has found so far, it will wander indefinitely looking for a better solution. In this case it didn't find anything better by the time I stopped it at 2,500 iterations. \begin{figure} <>= ivec = seq(1,nrow(metres$retval),by=25) M <- metres$retval[ivec,] op=par(mfrow=c(2,2),cex=1.5,mgp=c(2.5,1,0),las=1) par(mar=c(0,4.2,2.2,0)) matplot(ivec,M[,c(1,3)],type="l",col=1,lty=c(1,3), xlab="",ylab="",lwd=c(1,2),axes=FALSE) box() axis(side=2) text(corner.loc(1,1),"shape",adj=1) par(mar=c(0,0,2.2,4.2)) matplot(ivec,M[,c(2,4)],type="l",col=1,lty=c(1,3),ylab="", xlab="",lwd=c(1,2),axes=FALSE) box() axis(side=4) text(corner.loc(-1,1),"scale",adj=0) #par(mar=c(3.5,3.2,1,2)+0.1) par(mar=c(5,4.2,0,0)) plot(ivec,M[,5]/mean(M[,5]),type="l",ylab="", xlab="Iterations",ylim=c(0,1.7),axes=FALSE) axis(side=1,at=c(0,1000,2000)) axis(side=2,at=c(0,0.5,1.0)) box() frac.accept = filter(metres$retval[,9], rep(1, 40), sides = 1)/40 frac.accept = frac.accept[ivec] lines(ivec,frac.accept,lty=2) text(corner.loc(1,1,yoff=0.12),"relative\njump size",adj=1) text(corner.loc(1,-1,yoff=0.12),"fraction\naccepted",adj=1) par(mar=c(5,0,0,4.2)) matplot(ivec,M[,c(7,8)],type="l",col=1,lty=c(1,3),ylab="", xlab="Iterations",lwd=c(1,2),axes=FALSE,ylim=c(36,50)) axis(side=1,at=c(0,1000,2000)) axis(side=4,at=seq(0,1500,by=500)) text(corner.loc(1,1,yoff=0.14),"negative\nlog likelihood",adj=1) box() par(op) @ \caption{History of MSB evaluations: parameters (shape and scale), relative jump size and fraction of jumps accepted, and current and minimum negative log-likelihood. The minimum negative log-likelihood is achieved after \Sexpr{firstmin} steps; thereafter the algorithm remembers its best previous achievement (horizontal dotted line), but fails to improve on it.} \label{fig:msb2} \end{figure} Figure~\ref{fig:msb2} shows some statistics on MSB algorithm performance as the number of iterations increases. The top two panels show the values of the two parameters (shape and scale), and the best-fit parameters so far. Both of the parameters adjust quickly in the first 500 iterations, but from there they wander around without improving the fit. The third panel shows a scaled version of the jump-width parameter, which increases initially and then varies around 1.0, and the running average of the fraction of jumps accepted, which rapidly converges to a value around 0.5. The fourth and final panel shows the achieved value of the negative log-likelihood: almost all of the gains occur early. The MSB algorithm is inefficient for this problem, but it can be a lifesaver when your likelihood surface is complex and you have the patience to use brute force. There are many other stochastic global optimization algorithms. For example, \cite{Press+1994} suggest a hybrid of simulated annealing and the Nelder-Mead simplex where the vertices of the simplex are perturbed randomly but with decreasing amplitudes of noise over time. Other researchers have suggested using a stochastic algorithm to find the the right peak and finishing with a local algorithm (Nelder-Mead or derivative-based) to get a more precise answer. Various adaptive stochastic algorithms \cite[e.g.][]{Ingber1996} attempt to ``tune'' either the temperature or the jump size and distribution for better results. Methods like genetic algorithms or differential evolution use many points moving around the likelihood surface in parallel, rather than a single point as in simulated annealing. If you need stochastic global optimization, you will probably need a lot of computer time (many function evaluations are required) and you will almost certainly need to tune the parameters of whatever algorithm you choose rather than using the default values. \section{Markov chain Monte Carlo} Bayesians are normally interested in finding the means of the posterior distribution rather than the maximum likelihood value (or analogously the mode of the posterior distribution). Previous chapters suggested that you can use WinBUGS to compute posterior distributions, but gave few details. \emph{Markov chain Monte Carlo} (MCMC) is an extremely clever, general approach that uses stochastic jumps in parameter space to find the distribution. MCMC is similar to simulated annealing in the way it picks new parameter values sequentially but randomly. The main difference is that MCMC's goal is not to find the best parameter combination (mode/MLE) but to sample from the posterior distribution. Like simulated annealing, MCMC has many variants that use different rules for picking new parameter values (i.e., different kinds of candidate distributions) and for deciding whether accept the new choice or not. However, all variants of MCMC must satisfy the fundamental rule that the ratio of successful jump probabilities ($P_{\mbox{jump}} \times P_{\mbox{accept}}$) is proportional to the ratio of the posterior probabilities: \begin{equation} \frac{\mbox{Post}(A)}{\mbox{Post}(B)} = % \frac{P(\mbox{jump } B \to A) P(\mbox{accept }A|B)}% {P(\mbox{jump } A \to B) P(\mbox{accept }B|A)} \label{eq:mcmccrit} \end{equation} If we follow this rule (and if several other technical criteria are satisfied\footnote{The chain must be \emph{irreducible} (it must be possible eventually to move from any point in parameter space to any other) and \emph{aperiodic} (it should be impossible for it to get stuck in a loop).}), in the long run the chain will spend a lot of time occupying areas with high probability and will visit (but not spend much time in) in areas with low probability, so that the long-term distribution of the sampled points will match the posterior probability distribution. \subsection{Metropolis-Hastings} The \emph{Metropolis-Hastings} MCMC updating rule is very similar to the simulated annealing rules discussed above, except that the temperature does not decrease over time to make the algorithm increasingly picky about accepting uphill moves. The Metropolis updating rule defined above for simulated annealing (p.~\pageref{sec:simann}) can use any \emph{symmetric} candidate distribution ($P(\mbox{jump }B\to A) = P(\mbox{jump }A\to B$). For example, the MSB algorithm (p.~\pageref{msb}) picks values in a uniform distribution around the current set of parameters. The critical part of the Metropolis algorithm is the acceptance rule, which is the simulated annealing rule (eq.~\ref{eq:simann}) with the temperature parameter $k$ set to 1 and the posterior probability substituted for the likelihood\footnote{In the simulated annealing rule we exponentiated $-k$ times the log-likelihood difference, which gave us the likelihood ratio raised to the power $-k$; if we set $k=1$ then we have $\lik_{\mbox{old}}/\lik_{\mbox{new}}$, which corresponds to $\mbox{Post}(A)/\mbox{Post}(B)$.}. The Metropolis-Hastings rule generalizes the Metropolis by multiplying the acceptance probability by the ratio of the jump probabilities in each direction, $P(\mbox{jump }B\to A)/P(\mbox{jump }A\to B$): \begin{equation} P(\mbox{accept }B|A) = \min \left(1,\frac{\mbox{\small Post}(B)}{\mbox{\small Post}(A)} \cdot \frac{P(\mbox{\small jump }B\to A)}{P(\mbox{\small jump }A\to B)} \right) \label{eq:metrophast} \end{equation} This equation reduces to the Metropolis rule for symmetric distributions but allows for asymmetric candidate distributions, which is particularly useful if you need to adjust candidate distributions so that a parameter does not become negative. As in simulated annealing, if a new set of parameters has a higher posterior probability than the previous parameters (weighted by the asymmetry in the probability of moving between the parameter sets), then the ratio in (\ref{eq:metrophast}) is greater than 1 and we accept the new parameters with probability 1. If the new set has a lower posterior probability (weighted by jump probabilities), we accept them with a probability equal to the weighted ratio. If you work this out for $P(\mbox{accept }A|B)$ in a similar way, you'll see that the rule fits the basic MCMC criterion (\ref{eq:mcmccrit}). In fact, in the MSB example above the acceptance probability was set equal to the ratio of the likelihoods of the new and old parameter values (the \code{scale} parameter in \code{optimMSB} was left at its default value of 1), so that analysis also satisfied the Metropolis-Hasting rule (\ref{eq:metrophast}). Since it used negative log-likelihoods rather than incorporating an explicit prior to compute posterior probabilities, it assumed a completely flat prior (which can be dangerous, leading to unstable estimates or slow convergence, but seems to have been OK in this case). The \code{MCMCpack} package provides another way to run a Metropolis-Hastings chain in R. Given a function that computes the log posterior density (if the prior is completely flat, this is just the (\emph{positive}) log-likelihood function), the \code{MCMCmetrop1R} function first uses \code{optim} to find the posterior mode, then uses the approximate variance-covariance matrix at the mode to scale a multivariate normal candidate distribution, then runs a Metropolis-Hastings chain based on this candidate distribution. For example: <>= gammaNLL2B = function(p) { sum(dgamma(myxdat$titer,shape=p[1],scale=p[2],log=TRUE)) } m3 <- MCMCmetrop1R(gammaNLL2B,theta.init=c(shape=20,scale=0.05), thin=30,mcmc=30000, optim.lower=rep(0.004,2), optim.method="L-BFGS-B",tune=3) colnames(m3) = c("shape","scale") @ <>= curve3d(gammaNLL2B(c(x,y)),from=c(20,0.05),to=c(70,0.3), sys3d="contour",levels=seq(-200,0,by=10)) ## since shape*scale = x*y approx constant, try with ## mean = x' = x*y , var = y'=x*y^2 ## so x = x'^2/y', y = y'/x' ## shape (20,0.05), scale (70,0.3) ## means mean (1,21), var (0.05,6.3) curve3d(sum(dgamma(myxdat$titer,shape=x^2/y,scale=y/x,log=TRUE)), ## from=c(4,0.1),to=c(10,21), from=c(1,0.05),to=c(21,6.3), sys3d="contour",levels=seq(-200,0,by=10)) @ When I initially ran this analysis with the default value of \code{tune=1} and used \code{plot(m3)} to view the results, I saw that the chain took long excursions to extreme values. Inspecting the contour plot of the surface, and slices (\code{?calcslice} from the \code{emdbook} package) didn't suggest that there was another minimum that the chain was visiting during these excursions. The authors of the package suggested that \code{MCMCmetrop1R} was having trouble because of the banana-shape of the posterior density (Figure~\ref{fig:msb1}), and that increasing the \code{tune} parameter, which increases the scale of the candidate distribution, would help\footnote{They specifically suggested: \begin{enumerate} \item set the tuning parameter much larger than normal so that the acceptance rate is actually below the usual 20-25\% rule of thumb. This will fatten and lengthen the proposal distribution so that one can jump from one tail to the other. \item forego the proposal distribution based on the large sample var-cov matrix. Set the \code{V} parameter in \code{MCMCmetrop1R} to something that will work reasonably well over the entire parameter space. \item use an MCMC algorithm other than the random walk metropolis algorithm. You'll need to use something other than \code{MCMCmetrop1R} to do this but this option will be the most computationally efficient. \end{enumerate} }. Setting \code{tune=3} seems to be enough to make the chains behave better. (Increasing \code{tune} still more would make the Metropolis sampling less efficient.) Another option, which might take more thinking, would be to transform the parameters to make the likelihood surface closer to quadratic, which would make a multivariate Normal candidate distribution a better fit. Since the likelihood contours approximately follow lines of constant mean ($\mbox{shape} \cdot \mbox{scale}$: Figure~\ref{fig:nm2}), changing the parameterization from $\{\mbox{shape},\mbox{scale}\}$ to $\{\mbox{mean},\mbox{variance}\}$ makes the surface approximately quadratic and should make \code{MCMCmetrop1R} behave better. The \code{colnames} command sets the parameter names, which are helpful when looking at \code{summary(m3)} or \code{plot(m3)} since \code{MCMCmetrop1R} doesn't set the names itself. \subsection{Burn-in and convergence} Metropolis-Hastings updating, and any other MCMC rule that satisfies (\ref{eq:mcmccrit}), is guaranteed to reach the posterior distribution eventually, but usually we have to discard the iterations from a \emph{burn-in} period before the distribution converges to the posterior distribution. For example, during the first 300 steps in the MSB optimization above (Figures~\ref{fig:msb1} and \ref{fig:msb2}) the algorithm approaches the minimum from its starting points, and bounces around the minimum thereafter. Treating this analysis as an MCMC, we would drop the first 300 steps (or 500 to be safe) and focus on the rest of the data set. Assessing convergence is simple for such a simple model but can be difficult in general. Bayesian analysts have developed many \emph{convergence diagnostics}, but you probably only need to know about a few. The \emph{Raftery-Lewis} (RL) diagnostic \cite[\code{raftery.diag} in the \code{coda} package]{RafteryLewis1996} takes a pilot run of an MCMC and estimates, based on the variability in the parameters, how long the burn-in period should be and how many samples you need to estimate the parameters to within a certain accuracy. The parameters for the Raftery-Lewis diagnostic are the quantile that you want to estimate (2.5\% by default, i.e. the standard two-sided tails of the posterior distribution), the accuracy with which you want to estimate the quantile ($\pm 0.005$ by default), and the desired probability that the quantile is in the desired range (default 0.95). For the MSB/myxomatosis example above, running the Raftery-Lewis diagnostic with the default accuracy of $r=0.005$ said the pilot run of 2500 was not even long enough to estimate how long the chain should be, so I relaxed the accuracy to $r=0.01$: <>= m1 <- mcmc(metres$retvals[,1:2]) raftery.diag(m1,r=0.01) @ The first column gives the estimated burn-in time for each parameter --- take the maximum of these values as your burn-in time. The next two columns give the required total sample size and the sample size that would be required if the chain were uncorrelated; the final column gives the dependence factor, which essentially says how many steps the chain takes until it has ``forgotten'' about its previous value. In this case, RL says that we would need to run the chain for about 30,000 samples to get a sufficiently good estimate of the quantiles for the scale parameter, but that (because the dependency factor is close to 30) we could take every 30th step in the chain and not lose any important information. %\footnote{for some reason, %raising the quantile (e.g. seeing how many %steps it would take to estimate the %median, $q=0.5$) \emph{increases} the estimated %number of steps required \todo{why???}}. \label{multchain} Another way of assessing convergence is to run multiple chains that start from widely separated (\emph{overdispersed}) points and see whether they have run long enough to overlap (which is a good indication that they have converged). The starting points should be far enough apart to give a good sample of the surface, but should be sufficiently reasonable to give finite posterior probabilities. The \emph{Gelman-Rubin} \citep[G-R, \code{gelman.diag} in the \code{coda} package:][]{Gelman+1996} diagnostic takes this approach. G-R provides a \emph{potential scale reduction factor} (PRSF), estimating how much the between-chain variance could be reduced if the chains were run for longer. The closer to 1 the PRSFs are, the better. The rule of thumb is that they should be less than 1.2. Running a second chain (\code{m2}) for the myxomatosis data starting from (shape=70, scale=0.2) instead of (shape=20, scale=0.05) and running G-R diagnostics on both chains gives: <>= set.seed(1002) metres2 <- metropSB(gammaNLL2,c(70,0.2),retvals=TRUE,retfreq=1, nmax=2500) m2 <- mcmc(metres2$retvals[,1:2]) @ <<>>= gelman.diag(mcmc.list(m1,m2)) @ The upper confidence limit for the PRSF for parameter~1 (shape), and the estimated value for parameter~2 (scale), are both greater than 1.2. Apparently we need to run the chains longer. \subsection{Gibbs sampling} The major alternative to Metropolis-Hastings sampling is \emph{Gibbs sampling} (or the \emph{Gibbs sampler}), which works for models where we can figure out the posterior probability distribution (and pick a random sample from it), \emph{conditional} on the values of all the other parameters in the model. For example, to estimate the mean and variance of normally distributed data we can cycle back and forth between picking a random value from the posterior distribution for the mean, assuming a particular value of the variance, and picking a random value from the posterior distribution for the variance, assuming a particular value of the mean. The Gibbs sampler obeys the MCMC criterion (\ref{eq:mcmccrit}) because the candidate distribution \emph{is} the posterior distribution, so the jump probability ($P(\mbox{jump } B \to A)$) is equal to the posterior distribution of $A$. Therefore, the Gibbs sampler can always accept the jump $P_{\mbox{accept}}=1$ and still satisfy \begin{equation} \frac{\mbox{Post}(A)}{\mbox{Post}(B)} = \frac{P(\mbox{jump } B \to A) }% {P(\mbox{jump } A \to B)}. \end{equation} Gibbs sampling works particularly well for hierarchical models (Chapter~\ref{chap:var}). Whether we can do Gibbs sampling or not, we can do \emph{block sampling} by breaking the posterior probability up into a series of conditional probabilities. A complicated posterior distribution $\mbox{Post}(p_1,p_2,\ldots,p_n|y)= L(y|p_1,p_2,\ldots,p_n) \mbox{Prior}(p_1,p_2,\ldots,p_n)$, which is hard to compute in general, can be broken down in terms of the marginal posterior distribution of a single parameter ($p_1$ in this case), assuming all the other parameters are known: \begin{equation} \mbox{Post}(p_1|y,p_2,\ldots,p_n)= L(y|p_1,p_2,\ldots,p_n) \cdot P(p_1|p_2,\ldots,p_n) \cdot \mbox{Prior}(p_1,p_2,\ldots,p_n) \end{equation} %% \todo{fix this!!} This decomposition allows us to sample parameters one at a time, either by Gibbs sampling or by Metropolis-Hastings. The advantage is that the posterior distribution of a single parameter, conditional on the rest, may be simple enough so that we can sample directly from the posterior. %% \todo{Example??} BUGS (\textbf{B}ayesian inference \textbf{U}sing \textbf{G}ibbs \textbf{S}ampling) is an amazing piece of software that takes a description of a statistical model and automatically generates a Gibbs sampling algorithm\footnote{I will focus on a text file description, and on the \R\ interface to WinBUGS implemented in the \code{R2WinBUGS} package, but many different variants of automatic Gibbs samplers are springing up. These vary in interface, degree of polish and supported platforms. (1) WinBUGS runs on Windows, under WINE on Linux, and maybe soon on Intel Macs; models can be defined either graphically or as text files; R2WinBUGS is the R interface. (2) OpenBUGS (\url{http://mathstat.helsinki.fi/openbugs/}) is an new, open version of WinBUGS that runs on Windows and Linux (LinBUGS). OpenBUGS has an R interface, BRugs, but so far it only runs on Windows. (3) JAGS is an alternative version that runs on Linux and MacOS (but may be challenging to set up) and has an R interface.}. WinBUGS is the Windows version, and R2WinBUGS is the R interface for WinBUGS. Some BUGS models have already appeared in Chapter~6. BUGS's syntax closely resembles \R's, with the following important differences: \begin{itemize} \item{BUGS is not vectorized. Definitions of vectors of probabilities must be specified using a \code{for} loop.} \item{\R\ uses the \code{=} symbol to assign values. BUGS uses \verb+<-+ (a stylized left-arrow: e.g. \verb|a <- b+1| instead of \code{a=b+1}).} \item{BUGS uses a tilde (\verb+~+) to mean ``is distributed as''. For example, to say that $x$ comes from a standard normal distribution (with mean 0 and variance 1: $x \sim N(0,1)$, tell BUGS \verb+x~dnorm(0,1)+).} \item{While many statistical distributions have the same names as in \R (e.g. normal=\code{dnorm}, gamma=\code{dgamma}), watch out! BUGS often uses a different parameterization. For example, where \R\ uses \code{dnorm(x,mean,sd)}, BUGS uses \verb+x~dnorm(mean,prec)+ where \code{prec} is the \emph{precision} --- the reciprocal of the variance. Also note that \code{x} is included in the \code{dnorm} in \R, whereas in BUGS it is on the left side of the \verb+~+ operator. Read the BUGS documentation (included in WinBUGS) to make sure you understand BUGS's definitions.} \end{itemize} The model definition for BUGS should include the priors as well as the likelihoods. Here's a very simple input file, which defines a model for the posterior of the myxomatosis titer data: \verbatiminput{myxogamma.bug} After making sure that this file is present in your working directory (use Wordpad or Notepad to edit it; if you use Word, be sure to save the file as text), you can run this model in BUGS by way of R2WinBUGS as follows: <<>>= library(R2WinBUGS) @ <>= titer <- myxdat$titer n <- length(titer) inits <- list(list(shape=100,rate=3),list(shape=20,rate=10)) testmyxo.bugs <- bugs(data=list("titer","n"), inits,parameters.to.save=c("shape","rate"), model.file="myxogamma.bug", n.chains=length(inits),n.iter=5000) @ Printing out the value of \code{testmyxo.bugs} gives a summary including the mean, standard deviation, quantiles, and the Gelman-Rubin statistic (\code{Rhat}) for each variable. It also gives a DIC estimate for the model. By default this summary only uses a precision of 0.1, but you can use the \code{digits} argument to get more precision, e.g. \code{print(testmyxo.bugs,digits=2)}. <<>>= testmyxo.bugs @ The standard diagnostic plot for a WinBUGS run (\code{plot.bugs(testmyxo.bugs)}) shows the mean and credible intervals for each variable in each chain, as well as the Gelman-Rubin statistics for each variable. You can get slightly different information by turning the result into a \code{coda} object: <<>>= testmyxo.coda <- as.mcmc(testmyxo.bugs) @ \code{summary(testmyxo.coda)} gives similar information as printing \code{testmyxo.bugs}. \code{HPDinterval} gives the credible (\textbf{h}ighest \textbf{p}osterior \textbf{d}ensity) interval for each variable computed from MCMC output. Plotting \code{testmyxo.coda} gives \emph{trace plots} (similar to Figure~\ref{fig:msb2}) and \emph{density plots} of the posterior density (Figure~\ref{fig:codaplot}). Other diagnostic plots are available: see especially \code{?densityplot.mcmc}. \begin{figure} <>= palette(c("black","gray")) plot(testmyxo.coda) palette("default") @ \caption{WinBUGS output plot: default \code{coda} plot, showing trace plots (left) and density plots (right).} \label{fig:codaplot} \end{figure} <>= x = testmyxo.coda[[1]][,"shape"] plot(density(x)) h1 = HPDinterval(testmyxo.coda)[[1]]["shape",] h2 = qcredint(x,verbose=TRUE,tol=0.001) sum(x>h1[1]&xh2[1]&x>= nr=1000 nc=2000 m <- matrix(nrow=nr,ncol=nc) v <- numeric(nc) t1a <- system.time({ for (i in 1:nr) { for (j in 1:nc) { m[i,j] = rnorm(1) } }}) t1b <- system.time(m <- matrix(rnorm(nr*nc),nrow=nr,ncol=nc)) t2a <- system.time({ for (i in 1:nr) { v[i] = 0 for (j in 1:nc) { v[i] = v[i]+m[i,j] } }}) t2b <- system.time({ v = numeric(nr) for (i in 1:nc) { v[i] = sum(m[,i]) }}) t2c <- system.time(v <- apply(m,2,sum)) t2d <- system.time(v <- colSums(m)) tmpf <- function(t,d=0) { round(t["elapsed"],d)} @ Some possible solutions or partial solutions to this problem: \begin{itemize} \item{Use more efficient optimization algorithms, such as derivative-based algorithms instead of Nelder-Mead, if you can.} \item{Derive an analytical expression for the derivatives and write a function to compute it. \code{optim} and \code{mle2} can use this function (via the \code{gr} argument) instead of computing finite differences.} \item{Rewrite the code that computes the objective function more efficiently in \R. Vectorized operations are almost always faster than \code{for} loops. For example, filling a \Sexpr{nr} $\times$ \Sexpr{nc} matrix with Normally distributed values one at a time takes \Sexpr{tmpf(t1a)} seconds, while picking a million values and then reformatting them into a matrix takes only \Sexpr{tmpf(t1b,2)} seconds. Calculating the column sums of the matrix by looping over rows and columns takes \Sexpr{tmpf(t2a,1)} seconds; using \code{apply(m,1,sum)} takes \Sexpr{tmpf(t2c,2)} seconds; and using \code{colSums(m)} takes \Sexpr{tmpf(t2d,3)} seconds. } \item{If you can program in C or FORTRAN, or have a friend who can, write your objective function in one of these faster, lower-level languages and link it to \R\ (see the R Extensions Manual for details).} \item{For really big problems, you may need to use tools beyond \R. One such tool is AD Model Builder, which uses \emph{automatic differentiation} --- a very sophisticated algorithm for computing derivatives efficiently --- which can speed up computation a lot (\R\ has a very simple form of automatic differentiation built into its \code{deriv} function).} \item{Compromise by allowing a lower precision for your fits, increasing the \code{reltol} parameter in \code{optim}. Do you really need to know the parameters within a factor of $10^{-8}$, or would $10^{-3}$ do, especially if you know your confidence limits are likely to be much larger? (Be careful: increasing the tolerance in this way may also allow algorithms to stop prematurely at a flat spot on the way to the true minimum.)} \item{Find a faster computer, or wait longer for the answers.} \end{itemize} \subsection{Discontinuities and thresholds} Models with sudden changes in the log-likelihood (discontinuities) or derivatives of the log-likelihood, or perfectly flat regions, can cause real trouble for general-purpose optimization algorithms\footnote{Specialized algorithms, such as those included in the \code{segmented} package on CRAN, can handle certain classes of piecewise models \citep{Muggeo2003}.}. Discontinuities in the log-likelihood or its derivative can make derivative-based extrapolations wildly wrong. Almost-flat regions can make most methods (including Nelder-Mead) falsely conclude that they've reached a minimum. <>= thrfun <- function(x,a,b,thresh) { ifelse(x>= op <- par(mfrow=c(2,1),cex=1.5,mgp=c(2.5,1,0),las=1,lwd=2,bty="l") par(mar=c(1,4,1,1)+0.1) plot(data.x,data.y,ylab="y",xlab="",axes=FALSE) axis(side=2) box() curve(ifelse(x<3,O1$par[1],O1$par[2]),add=TRUE,lty=1,type="s") curve(O2$par[1]+(O2$par[2]-O2$par[1])/(1+exp(-O2$par[4]*(x-O2$par[3]))),add=TRUE,lty=2) par(mar=c(4,4,3,1)+0.1) plot(thrvec,thrprof, xlab="x",ylab="Negative log likelihood",type="s") lines(thrvec,logistprof0[,"value"],lty=2) text(c(0,0),c(thrprof[1], logistprof0[1,"value"])+5, c("threshold","logistic"),adj=0,xpd=NA) par(op) @ \caption{Threshold and logistic models. Top: data, showing the data (generated from a threshold model) and the best threshold and logistic fits to the data. Bottom: likelihood profiles.} \label{fig:thresh1} \end{figure} % why does logistic do so much better when the threshold is way off? <>= plot(data.x,data.y,xlab="x",ylab="y") with (as.list(logistprof0[10,]), curve(par.a+(par.b-par.a)/(1+exp(-par.slope*(x-thrvec[10]))),add=TRUE)) with (as.list(thrprof0[10,]), curve(ifelse(x>= data(ReedfrogSizepred) attach(ReedfrogSizepred,warn=FALSE) logist2 <- function(x,sizep1,sizep2,sizep3) { exp(sizep1*(sizep3-x))/(1+exp(sizep2*sizep1*(sizep3-x))) } lik.logist2 <- function(sizep1,sizep2,sizep3) { p.pred <- logist2(TBL,sizep1,sizep2,sizep3) -sum(dbinom(Kill,size=10,prob=p.pred,log=TRUE)) } mle.logist2 <- mle2(lik.logist2, start=list(sizep1=0,sizep2=1,sizep3=12),method="Nelder-Mead") mle.logist2B <- mle2(lik.logist2,start=list(sizep1=0,sizep2=1,sizep3=12), method="BFGS",control=list(parscale=c(0.3,30,10))) mle.logist2C <- mle2(lik.logist2,start=list(sizep1=0,sizep2=1,sizep3=12), method="BFGS",control=list(maxit=1000)) mle.logist2D <- mle2(lik.logist2,start=as.list(coef(mle.logist2C)), control=list(maxit=1000)) slice2 = calcslice(mle.logist2,mle.logist2C,n=1000) detach(ReedfrogSizepred) @ \begin{figure} <>= op <- par(cex=1.5,mgp=c(2.5,1,0),las=1,lwd=2,bty="l") plot(slice2,type="l",ylab="Negative log-likelihood",xlab="") ## plot(slice2,type="l",ylab="Negative log-likelihood",xlab="",ylim=c(0,20)) points(0,-logLik(mle.logist2)) points(1,-logLik(mle.logist2C)) ##abline(h=-logLik(mle.logist2C)) abline(h=-logLik(mle.logist2C)) abline(h=-logLik(mle.logist2C)+qchisq(0.95,1)/2,lty=2) par(mar=c(2,2,0,0),bty="l",cex=0.7,lwd=1, mgp=c(1,0,0)) par(new=TRUE,fig=c(0.22,0.50,0.55,0.85)) sizeplot(ReedfrogSizepred$TBL,ReedfrogSizepred$Kill, xlim=c(5,30),ylim=c(-0.4,10),axes=FALSE, xlab="size",ylab="killed",cex.lab=1.5) with(as.list(coef(mle.logist2)),curve(logist2(x,sizep1,sizep2,sizep3)*10,add=TRUE,lwd=2)) box() par(new=TRUE,fig=c(0.7,0.98,0.55,0.85)) sizeplot(ReedfrogSizepred$TBL,ReedfrogSizepred$Kill, xlim=c(5,30),ylim=c(-0.4,10),axes=FALSE, xlab="size",ylab="killed",cex.lab=1.5) with(as.list(coef(mle.logist2C)),curve(logist2(x,sizep1,sizep2,sizep3)*10,add=TRUE,lwd=2)) box() par(op) @ \caption{Likelihood slice connecting two negative log-likelihood minima for the modified logistic model of \cite{VoneshBolker2005}. The $x$ axis is on an arbitrary scale where $x=0$ and $x=1$ represent the locations of the two minima. Subplots show the fits of the curves to the frog predation data for the parameters at each minimum; the right-hand minimum is a slightly better fit ($-L=\Sexpr{round(as.numeric(-logLik(mle.logist2C)),2)}$ (right) vs. \Sexpr{round(as.numeric(-logLik(mle.logist2)),2)} (left)). The horizontal solid and dashed lines show the minimum negative log-likelihood and the 95\% confidence cutoff ($-L+\chi^2_1(0.95)/2$). The 95\% confidence region includes small regions around both $x=0$ and $x=1$.} \label{fig:voneshmult} \end{figure} Multiple minima are a challenging problem, and are particularly scary because they're not always obvious --- especially in high-dimensional problems. Figure~\ref{fig:voneshmult} shows a slice through parameter space connecting two minima that occur in the negative log-likelihood surface of the modified logistic function that \cite{VoneshBolker2005} used to fit data on tadpole predation as a function of size (the function \code{calcslice} in the \code{emdbook} package will compute such a slice). Such a pattern strongly suggests, although it does not guarantee, that the two points really are local minima. When we wrote the paper, we were aware only of the left-hand minimum, which seemed to fit the data reasonably well. In preparing this chapter, I re-analyzed the data using BFGS instead of Nelder-Mead optimization and discovered the right-hand fit, which is actually slightly better ($-L=\Sexpr{round(as.numeric(-logLik(mle.logist2C)),2)}$ compared to \Sexpr{round(as.numeric(-logLik(mle.logist2)),2)} for the original fit). Since they use different rules, the Nelder-Mead and BFGS algorithms found their way to different minima despite starting at the same point. This is alarming. While the log-likelihood difference (\Sexpr{round(as.numeric(-logLik(mle.logist2)+logLik(mle.logist2C)),2)}) is not large enough to reject the first set of parameters, and while the fit corresponding to those parameters still seems more biologically plausible (a gradual increase in predation risk followed by a slightly slower decrease, rather than a very sharp increase and gradual decrease), we had no idea that the second minimum existed. \cite{Etienne+2006b} pointed out a similar issue affecting a paper by \cite{Latimer+2005} about diversification patterns in the South African fynbos: some estimates of extremely high speciation rates turned out to be spurious minima in the model's likelihood surface (although the basic conclusions of the original paper still held). No algorithm can promise to deal with the pathological case of a very narrow, isolated minimum as in Figure~\ref{fig:direct}. To guard against multiple-minimum problems, try to fit your model with several different reasonable starting points, and check to make sure that your answers are reasonable. If your results suggest that you have multiple minima --- that is, you get different answers from different starting points --- check the following: \begin{itemize} \item{Did both fits really converge properly? The fits returned by \code{mle2} from the \code{bbmle} package will warn you if the optimization did not converge; for \code{optim} results you need to check the \verb+$convergence+ term of results (it will be zero if there were no problems). Try restarting the optimizations from both of the points where the optimizations ended up, possibly resetting \code{parscale} to the absolute value of the fitted parameters. (If \code{O1} is your first \code{optim} fit, run the second fit with \verb+control=list(parscale=abs(O1$par))+. If \code{O1} is an \code{mle2} fit, use \verb+control=list(parscale=abs(coef(O1)))+.) Try different optimization methods (BFGS if you used Nelder-Mead, and \emph{vice versa}). Calculate slices or profiles around the optima to make sure they really look like local minima.} \item{Use \code{calcslice} to compute a likelihood slice between the two putative fits to make sure that the surface is really higher between them.} \end{itemize} If your surface contains several minima, the simplest solution may be to use a simple, fast method (like BFGS) but to start it from many different places. This will work if the surface is essentially smooth, but with two (or many) valleys of approximately the same depth% \footnote{The many-valley case, or rather its inverse the many-peaks case (if we are maximizing rather than minimizing), is sometimes known as a ``fakir's bed'' problem after the practice of sitting on a board full of nails \citep{Swartz2003}.}. You will need to decide how to assign starting values (randomly or on a grid? along some transect?), and how many starting values you can afford to try. You may need to tune the optimization parameters so that each individual optimization runs as fast and smoothly as possible. Researchers have also developed hybrid approaches based on multiple starts \citep{Tucci2002}. When multiple minima occur it is possible, although unusual, for the 95\% confidence limits to be discontinuous --- that is, for there to be separate regions around each minimum that are supported by the data. This does happen in the case shown in Figure~\ref{fig:voneshmult}, although on the scale of that figure the confidence intervals in the regions around $x=0$ and $x=1$ would be almost too small to see. More commonly, either one minimum will be a lot deeper than the other so that only the region around one minimum is included in the confidence region, or the minima will be about the same height but the two valleys will join at the height of the 95\% cutoff so that the 95\% confidence interval is continuous. If the surface is jagged instead of smooth, or if you have a sort of fractal surface --- valleys within valleys, of many different depths --- a stochastic global method such as simulated annealing is probably your best bet. Markov chain Monte Carlo can in principle deal with multiple modes, but convergence can be slow --- % you need to start chains at different modes and allow enough time for each chain to wander to all of the different modes \citep[see][for a related example in phylogenetics]{MosselVigoda2006,Ronquist+2006}. \subsection{Constraints} The last technical detail covered here is the problem of constraining parameter values within a particular range. Constraints occur for many reasons, but the most common constraints in ecological models are that some parameters make sense only when they have positive values (e.g. predation or growth rates) or values between 0 and 1 (e.g. probabilities). The three important characteristics of constraints are: \begin{enumerate} \item{\emph{Equality} vs. \emph{inequality} constraints: must a parameter or set of parameters be exactly equal to some value, or just within boundaries? Constraints on individual parameters are always inequality constraints (e.g. $0>= weiblikfun <- function(shape,scale) { -sum(log(pweibull(dat$d2,shape,scale)- pweibull(dat$d1,shape,scale))) } @ For this example, I've used the cruder, simpler approach of averaging \code{d1} and \code{d2}. } <<>>= library(emdbookx) data(GobySurvival) dat = subset(GobySurvival,exper==1 & density==9 & qual>median(qual)) time = (dat$d1+dat$d2)/2 @ Set up a simple likelihood function: <<>>= weiblikfun = function(shape,scale) { -sum(dweibull(time,shape=shape,scale=scale,log=TRUE)) } @ Fit the model starting from an exponential distribution (if scale=$a=1$, the distribution is an exponential with rate $1/b$ and mean $b$): <<>>= w1 <- mle2(weiblikfun,start=list(shape=1,scale=mean(time))) @ <>= meanfun = function(shape,scale) { scale*gamma(1+1/shape) } @ The parameter estimates (\code{coef(w1)}) are shape=\Sexpr{round(coef(w1)["shape"],3)} and scale=\Sexpr{round(coef(w1)["scale"],3)}, the estimate of the mean survival time (using \code{meanfun} and plugging in the parameter estimates) is \Sexpr{round(meanfun(coef(w1)["shape"],coef(w1)["scale"]),3)}. \subsection{Profile likelihood} Now we'd like confidence intervals for the mean that take variability in both shape and scale into account. The most rigorous way to estimate confidence limits on a non-parameter is to calculate the profile likelihood for the value and find the 95\% confidence limits, using almost the same procedure as if you were finding the univariate confidence limits of one of the parameters. \begin{figure} <>= weiblikfun2 <- function(shape,mu) { scale <- mu/gamma(1+1/shape) -sum(dweibull(time,shape=shape,scale=scale,log=TRUE)) } w2 <- mle2(weiblikfun2,start=list(shape=1,mu=mean(time))) @ <>= ## compute the contour surface smat <- curve3d(weiblikfun, from=c(0.5,5), to=c(1.2,25), n=c(40,40),sys3d="none") #%Now let's superimpose on this plot contours that represent #%the mean survival time. #%Calculate the mean values for all combinations #%of shape and scale: #%The \code{outer} command is a quick way to generate matrices, #%but it only works for vectorizable functions. mmat <- outer(smat$x,smat$y,meanfun) ## shortcut @ <>= op=par(lwd=2,bty="l",las=1,cex=1.5,mgp=c(2.5,1,0)) alevels = c(0.8,0.9,0.95,0.99) mlevels = seq(6,38,by=4) contour(smat$x,smat$y,smat$z,levels=qchisq(alevels,1)/2-logLik(w1), labels=alevels, xlab="Shape",ylab="Scale",col="darkgray",labcex=1) points(coef(w1)[1],coef(w1)[2]) contour(smat$x,smat$y,mmat,add=TRUE,col=1,levels=mlevels,labcex=1,method="edge") p2 = profile(w2) p2B = p2@profile$mu$par.vals shape <- p2B[,1] mu <- p2B[,2] scale <- mu/gamma(1+1/shape) lines(shape,scale,lty=3) shvals <- approx(p2@profile$mu$z,p2B[,"shape"],xout=c(-1.96,1.96))$y muvals <- approx(p2@profile$mu$z,p2B[,"mu"],xout=c(-1.96,1.96))$y contour(smat$x,smat$y,mmat,add=TRUE,col=1,levels=round(muvals,1),lty=2,drawlabels=FALSE) scvals <- muvals/gamma(1+1/shvals) points(shvals,scvals,pch=3) @ \caption{Geometry of confidence intervals on mean survival time. Gray contours: univariate (80\%, 90\%, 95\%, 99\%) confidence intervals for shape and scale. Black contours: mean survival time. Dotted line: likelihood profile for mean survival time.} \label{fig:weibsurf1} \end{figure} Figure~\ref{fig:weibsurf1} illustrates the basic geometry of this problem: the underlying contours of the height of the surface (contours at 80\%, 95\%, and 99\% \emph{univariate} confidence levels) are shown in gray. The black contours show the lines on the plot that correspond to different constant values of the mean survival time. The dotted line is the likelihood profile for the mean, which passes through the minimum negative log-likelihood point on each mean contour, the point where the mean contour is tangent to a likelihood contour line. We want to find the intersections of the likelihood ratio test contour lines with the likelihood profile for the mean: looking at the 95\% line, we can see that the confidence intervals of the mean are approximately \Sexpr{round(muvals[1])} to \Sexpr{round(muvals[2])}. \subsubsection{The value can be expressed in terms of other parameters} When the value for which you want to estimate confidence limits has a formula that you can solve in terms of one of the parameters, calculating its confidence limits is easy. For the Weibull distribution the mean $\mu$ is given by \begin{equation} \mu = \mbox{scale} \cdot \Gamma(1+1/\mbox{shape}), \label{eq:weibmean} \end{equation} Or, translating to \R: <<>>= meanfun = function(shape,scale) { scale*gamma(1+1/shape) } @ How do we actually calculate the profile for the mean? We can solve equation \ref{eq:weibmean} for one of the parameters: \begin{equation} \mbox{scale} = \frac{\mu}{\Gamma(1+1/\mbox{shape})} \end{equation} Therefore we can find the likelihood profile for the mean in almost the same way we would for one of the parameters. Fix the value of $\mu$: then, for each value of the shape that \R\ tries on its way to estimating the parameter, it will calculate the value of the scale that must apply if the mean is to be fixed at $\mu$. The constraint means that, even though the model has two parameters (shape and scale), we are really doing a one-dimensional search: it just happens to be a search along a specified constant-mean contour. In order to calculate the confidence interval on the mean, we have to rewrite the likelihood function in terms of the mean: <<>>= weiblikfun2 <- function(shape,mu) { scale <- mu/gamma(1+1/shape) -sum(dweibull(time,shape=shape,scale=scale,log=TRUE)) } @ Find the maximum again, and calculate the confidence intervals --- this time for the shape and the mean. <<>>= w2 <- mle2(weiblikfun2,start=list(shape=1,mu=mean(time))) confint(w2,quietly=TRUE) @ <>= ci.prof <- confint(w2,quietly=TRUE)["mu",] @ % \begin{figure} <>= op=par(lwd=2,bty="l",las=1,cex=1.5,mgp=c(2.5,1,0)) source("sqrprofplot.R") sqrprofplot(p2,which=2,sqrVal=TRUE,conf=c(0.9,0.95), col.conf="gray",plot.confstr=TRUE, col.prof="black",col.minval=NA, xlab="Mean", ylab="Deviance") @ % \caption{Likelihood profile for Weibull mean} % \label{fig:weibprof1} % \end{figure} We could also draw the univariate likelihood profile, the minimum negative log-likelihood achievable for each value of the mean, and find the 95\% confidence limits in the same way as before by creating a likelihood profile for $\mu$. We would use 1~degree of freedom to establish the critical value for the LRT because we are only varying one value, even though it represents a combination of two parameters. \subsubsection{Constrained/penalized likelihood} What if we can't solve for one of the parameters (e.g. scale) in terms of the value we are interested in (e.g. mean), but still want to calculate a likelihood profile and profile confidence limits for the mean? We can use a penalized likelihood function to constrain the mean to a particular value, as described above in the section on constraints. <>= weiblikfun3 <- function(shape,scale,target.mu,penalty=1000) { mu <- meanfun(shape,scale) NLL = -sum(dweibull(time,shape=shape,scale=scale,log=TRUE)) pen = penalty*(mu-target.mu)^2 NLL+pen } w3 <- mle2(weiblikfun3,start=list(shape=0.9,scale=13),fixed=list(target.mu=13)) ##profile(w3,which="target.mu") penlikfun3 <- function(p,meanval,penalty=1000) { v <- -sum(dweibull(time,shape=p[1],scale=p[2],log=TRUE)) pen <- penalty*(meanfun(p[1],p[2])-meanval)^2 val <- v+pen ## cat(p[1],p[2],penalty,meanval,v,pen,val,"\n") return(val) } penlikfun2 <- function(m,penalty=1000) { o <- optim(fn=penlikfun3, par=coef(w1), method="BFGS", meanval=m, penalty=penalty) r <- c(o$value,o$par) names(r) <- c("NLL","shape","scale") r } critval <- c(-logLik(w1)+qchisq(0.95,1)/2) pcritfun <- function(m,penalty=1000) { penlikfun2(m,penalty=penalty)[1]-critval } lowx <- c(5,13) upx <- c(14,30) penlower <- uniroot(pcritfun,lowx)$root penupper <- uniroot(pcritfun,upx)$root ci.pen1 = c(penlower,penupper) penvec <- 1:5 ##v <- seq(5,13,n=101) ##plot(v,sapply(v,pcritfun,penalty=10^2),type="l") penresults <- matrix(nrow=length(penvec),ncol=3) for (i in 2:length(penvec)) { penlower <- uniroot(pcritfun,lowx,penalty=10^penvec[i])$root penupper <- uniroot(pcritfun,upx,penalty=10^penvec[i])$root penresults[i,] <- c(penvec[i],penlower,penupper) } @ %% FIXME: something a bit odd going on here? don't understand what, %% exactly, but was running into numerical glitches While this approach is conceptually the same as the one we took in the previous section --- we are calculating the profile by sliding along each mean contour to find the minimum negative log-likelihood on that contour, then finding the values of the mean for which the minimum negative log-likelihood equals the LRT cutoff --- the problem is much fussier numerically. (The complicated code is presented on p.~\pageref{constrlik}). To use penalties effectively we usually have to play around with the strength of the penalty. Too strong, and our optimizations will get stuck somewhere far away from the real minimum. Too weak, and our optimizations will wander off the line we are trying to constrain them to. I tried a variety of penalty coefficients in this case (penalty = $C \times$ (deviation of mean survival from target value)$^2$) from 0.1 to $10^6$. The results were essentially identical for penalties ranging from 1 to $10^4$, but varied for weaker or stronger penalties. One might be able to tweak the optimization settings some more to make the answers better, but there's no really simple recipe --- you just have to keep returning to the pictures to see if your answers make sense. \subsection{The delta method} The delta method provides an easy approximation for the confidence limits on values that are not parameters of the model. To use it you must have a formula for $\mu=f(a,b)$ that you can differentiate with respect to $a$ and $b$. Unlike the first likelihood profile method, you don't have to be able to solve the equation for one of the parameters. The formula for the delta method comes from a Taylor expansion of the formula for $\mu$, combined with the definitions of the variance ($V(a) = E[(a-\bar a)^2]$) and covariance ($C(a,b)= E[(a-\bar a) (b - \bar b)]$): \begin{equation} V(f(a,b)) \approx V(a) \left( \frac{\partial f}{\partial a} \right)^2 + V(b) \left( \frac{\partial f}{\partial b} \right)^2 + 2 C(a,b) \frac{\partial f}{\partial a} \frac{\partial f}{\partial b}. \end{equation} See the Appendix, or \cite{Lyons1991} for a derivation and details. We can obtain approximate variances and covariances of the parameters by taking the inverse of the information matrix: \code{vcov} does this automatically for \code{mle2} fits. We also need the derivatives of the function with respect to the parameters. In this example these are the derivatives of $\mu = b \Gamma(1+1/a)$ with respect to shape=$a$ and scale=$b$. The derivative with respect to $b$ is easy --- % $\partial \mu/\partial b =\Gamma(1+1/a)$) --- but $\partial \mu/\partial a$ is harder. By the chain rule \begin{equation} \frac{\partial (\Gamma(1+1/a))}{\partial a} = \frac{\partial (\Gamma(1+1/a))}{\partial (1+1/a)} \cdot \frac{\partial (1+1/a)}{\partial a} = \frac{\partial (\Gamma(1+1/a))}{\partial (1+1/a)} \cdot -\frac{1}{a^2}, \end{equation} but in order to finish this calculation you need to know that $d \Gamma(x)/dx = \Gamma(x) \cdot \mbox{digamma}(x)$, where digamma is a special function (defined as the derivative of the log-gamma function). The good news is that \R\ knows how to compute this function, so a command like <>= shape.deriv <- -shape^2*gamma(1+1/shape)*digamma(1+1/shape) @ will give you the right numeric answer. The \code{emdbook} package has a built-in \code{deltavar} function that uses the delta method to compute the variance of a function: <<>>= dvar <- deltavar(fun=scale*gamma(1+1/shape),meanval=coef(w1),Sigma=vcov(w1)) @ Once you find the variance of the mean survival time, you can take the square root to get the standard deviation $\sigma$ and calculate the approximate confidence limits $\mu \pm 1.96 \sigma$. <<>>= sdapprox <- sqrt(dvar) mlmean <- meanfun(coef(w1)["shape"],coef(w1)["scale"]) ci.delta <- mlmean+c(-1.96,1.96)*sdapprox @ If you can't compute the derivatives manually, \R's \code{numericDeriv} function will compute them numerically (p.~\pageref{sec:numderiv}). <>= pvars <- diag(vcov(w1)) pcov <- vcov(w1)["shape","scale"] mlshape <- coef(w1)["shape"] mlscale <- coef(w1)["scale"] scderiv <- gamma(1+1/mlshape) shderiv <-mlscale*-1/mlshape^2*gamma(1+1/mlshape)*digamma(1+1/mlshape) vapprox1 <- pvars["scale"]*scderiv^2 vapprox2 <- pvars["shape"]*shderiv^2 vapprox3 <- 2*pcov*shderiv*scderiv vapprox <- vapprox1+vapprox2+vapprox3 @ \subsection{Population prediction intervals (PPI)} \label{sec:ppi} Another simple procedure for calculating confidence limits is to draw random samples from the estimated sampling distribution (approximated by the information matrix) of the parameters. In the approximate limit where the information matrix approach is valid, it turns out that the distribution of the parameters will be multivariate normal with a variance-covariance matrix given by the inverse of the information matrix. The \code{MASS} package in \R\ has a function, \code{mvrnorm}\footnote{\code{mvrnorm} should really be called \code{rmvnorm} for consistency with \R's other distribution functions, but \code{S-PLUS} already has a built-in function called \code{rmvnorm}, so the MASS package had to use a different name.}, for selecting multivariate normal random deviates. With the \code{mle2} fit \code{w1} from above, then <<>>= vmat=mvrnorm(1000,mu=coef(w1),Sigma=vcov(w1)) @ will select 1000 sets of parameters drawn from the appropriate distribution (if there are $n$ parameters, the answer is a $1000\times n$ matrix). (If you have used \code{optim} instead of \code{mle2} --- % suppose \code{opt1} is your result --- then use \verb+opt1$par+ for the mean and \verb+solve(opt1$hessian)+ for the variance.) You can then use this matrix to calculate the estimated value of the mean for each of the sets of parameters, treat this distribution as a distribution of means, and find its lower and upper 95\% quantiles (Figure~\ref{fig:ppibayes}). In the context of population viability analysis, \cite{Lande+2003} refer to confidence intervals computed this way as ``population prediction intervals''. This procedure is easy to implement in \R, as follows: <<>>= dist = numeric(1000) for (i in 1:1000) { dist[i] = meanfun(vmat[i,1],vmat[i,2]) } quantile(dist,c(0.025,0.975)) @ <>= ci.ppi <- quantile(dist,c(0.025,0.975)) @ Calculating population prediction intervals in this way has two disadvantages: \begin{itemize} \item{It blurs the line between frequentist and Bayesian approaches. Several papers (including some of mine, e.g. \cite{VoneshBolker2005}) have used this approach, but I have yet to see a solidly grounded justification for propagating the sampling distributions of the parameters in this way.} \item{Since it uses the asymptotic estimate of the parameter variance-covariance matrix, it inherits whatever inaccuracies that approximation introduces. It makes one fewer assumption than the delta method (it doesn't assume the variance is so small that the functions are close to linear), but it may not be all that much more accurate.} \end{itemize} <>= lval <- coef(w1)["scale"]^(-coef(w1)["shape"]) n <- length(time) inits <- list(list(shape=0.8,lambda=lval),list(shape=0.4,lambda=lval*2), list(shape=1.2,lambda=lval/2)) reefgoby.bugs <- bugs(data=list("time","n"), inits,parameters.to.save=c("shape", "scale","lambda","mean"), model.file="reefgobysurv.bug", n.chains=length(inits),n.iter=5000) reefgoby.coda <- as.mcmc(reefgoby.bugs) reefgoby.coda <- lump.mcmc.list(reefgoby.coda) ci.bayes <- HPDinterval(reefgoby.coda)["mean",] m <- as.matrix(reefgoby.coda)[,"mean"] m <- reefgoby.coda[,"mean",drop=FALSE] m <- as.matrix(reefgoby.coda)[,"mean"] ci.bayesq <- quantile(m,c(0.025,0.975)) @ \begin{figure} <>= op=par(lwd=2,bty="l",las=1,cex=1.5,mgp=c(3,1,0)) ##par(mar=c(1,4,2,2)) vals = mvrnorm(1000,mu=coef(w1),Sigma=vcov(w1)) dist = apply(vals,1,function(x)meanfun(x[1],x[2])) textsize = 0.5 ang = 15 par(yaxs="i") ## hist(dist,breaks=30,col="gray",xlab="Mean survival time",main="",freq=FALSE, ## axes=FALSE,xlim=c(5,35)) ## axis(side=2,at=c(0,0.05,0.1),yaxs="i") ## axis(side=1) ##box() ##lines(density(dist)) conf.ppi <- quantile(dist,c(0.025,0.975)) ##abline(v=conf.ppi,lty=2,lwd=2) ##par(mar=c(5,4,1,2)) par(mar=c(4,4,2,2)) d1 = density(dist) d2 = density(lump.mcmc.list(as.mcmc(reefgoby.bugs)[,"mean"])) plot(d1,ylim=c(0,0.12),main="", xlab="",axes=FALSE,xlim=c(0,30)) axis(side=2,at=c(0,0.05,0.1,0.15)) lines(d2,col="gray") axis(side=1) box() mtext(side=1,line=2.5,at=15,"Mean survival time",cex=1.5) segments(rep(conf.ppi,2), rep(0,4), rep(conf.ppi,2), approx(d1$x,d1$y,xout=conf.ppi)$y) mpos = 13 labsize = 3 ht1 <- 0.006 arrows(c(mpos-labsize,mpos+labsize), rep(ht1,2), conf.ppi, rep(ht1,2),ang=ang) text(mpos,ht1,"PP interval",cex=textsize) segments(rep(ci.bayes,2), rep(0,4), rep(ci.bayes,2), approx(d2$x,d2$y,xout=ci.bayes)$y, col="gray") ht2 <- 0.012 mpos2 <- 15 labsize2 <- 3.5 arrows(c(mpos2-labsize2,mpos2+labsize2), rep(ht2,2), ci.bayes, rep(ht2,2),col="gray",ang=ang) text(mpos2,ht2,"Bayes credible",col="darkgray",cex=textsize) ## segments(rep(ci.bayesq,2), ## rep(0,4), ## rep(ci.bayesq,2), ## approx(d2$x,d2$y,xout=ci.bayesq)$y, ## col="gray",lty=3) ## ht3 <- 0.012 ## mpos3 <- 16 ## labsize3 <- 3.5 ## arrows(c(mpos3-labsize3,mpos3+labsize3), ## rep(ht3,2), ## ci.bayesq, ## rep(ht3,2),col="gray",ang=ang) ## text(mpos3,ht3,"Bayes quantile",col="darkgray",cex=textsize) par(op) @ \caption{Population prediction distribution and Bayesian posterior distribution of mean survival time, with confidence and credible intervals.} \label{fig:ppibayes} \end{figure} \subsection{Bayesian analysis} Finally, you can use a real Bayesian method: construct either an exact Bayesian model, or, more likely, a Markov chain Monte Carlo analysis for the parameters. Then you can calculate the posterior distribution of any function of the parameters (such as the mean survival time) from the posterior samples of the parameters, and get the 95\% % quantiles or credible interval. % R: shape a, scale b: (a/b) (x/b)^(a-1) exp(- (x/b)^a) % = a b^(-a) x^(a-1) exp(-(b^(-a)) x^a) % BUGS: shape nu, lambda: nu lambda x^(nu-1) exp(-lambda x^nu) % lambda= b^(-a) % b = lambda^-1/a % mean = b*gamma(1+1/a) % = lambda^-1/nu*gamma(1+1/nu) The hardest part of this analysis turns out to be converting between \R\ and WinBUGS versions of the Weibull distribution: where \R\ uses $f(t)=(a/b)(t/b)^{a-1} \exp(-(t/b)^a)$, WinBUGS uses $f(t)=\nu \lambda t^{\nu-1} \exp(-\lambda t^\nu)$. Matching up terms and doing some algebra shows that $\nu=a$ and $\lambda=b^{-a}$ or $b=\lambda^{-1/a}$. The BUGS model is: \verbatiminput{reefgobysurv.bug} Other differences between \R\ and WinBUGS are that BUGS uses \code{pow(x,y)} instead of \verb+x^y+ and has only a log-gamma function \code{loggam} instead of \R's \code{gamma} and \code{lgamma} functions. The model includes code to convert from WinBUGS to R parameters (i.e., calculating \code{scale} as a function of \code{lambda}) and to calculate the mean survival time, but you could also calculate these values in \R. Set up three chains that start from different, overdispersed values of shape and $\lambda$: <>= lval <- coef(w1)["scale"]^(-coef(w1)["shape"]) n <- length(time) inits <- list(list(shape=0.8,lambda=lval),list(shape=0.4,lambda=lval*2), list(shape=1.2,lambda=lval/2)) @ \label{winbugsrun} Run the chains: <>= reefgoby.bugs <- bugs(data=list("time","n"), inits,parameters.to.save=c("shape","scale","lambda","mean"), model.file="reefgobysurv.bug", n.chains=length(inits),n.iter=5000) @ Finally, use \code{HPDinterval} or \code{summary} to extract credible intervals or quantiles from the MCMC output. Figure~\ref{fig:ppibayes} compares the marginal posterior density of the mean and the credible intervals computed from it with the distribution of the mean derived from the sampling distribution of the parameters and the population prediction intervals (Section~\ref{sec:ppi}). %lower and upper quantiles (\code{quantile}), which are %more analogous to the population prediction intervals. %The \R\ supplement gives more detail on the calculations. \subsection{Confidence interval comparison} Here's a head-to-head comparison of all the methods we've applied so far: <>= citab = rbind(ci.prof,ci.pen1,ci.delta,ci.ppi,ci.bayes)#,ci.bayesq) cinames = c("exact profile","profile:penalty", "delta method","PPI", "Bayes credible") #,"Bayes quantile") dimnames(citab)=list(cinames,c("lower","upper")) @ <>= latex(round(citab,3),file="",table.env=FALSE,title="method") @ <>= tabpaste <- function(x,sep=" & ",eol="\\") { x2 <- rbind(c("",colnames(x)),cbind(rownames(x), matrix(as.character(x),nrow=nrow(x)))) x3 <- apply(x2,1,paste,collapse=sep) x4 <- paste(x3,collapse=eol) return(x4) } @ All methods give approximately the same answers. Despite answering a different question, the Bayes credible interval is in the same range as the other confidence intervals. The point to take away from this comparison is that \emph{all} methods for estimating confidence limits use approximations, some cruder than others. Use the most accurate feasible approach, but don't expect estimates of confidence limits to be very precise. To paraphrase a comment of \cite{Press+1994}, if the difference between confidence-interval approximations ever matters to you, ``then you are probably up to no good anyway --- e.g., trying to substantiate a questionable hypothesis with marginal data''\footnote{Their original statement referred to whether to divide by $n$ or $n-1$ when estimating a variance}. %\todo{drop Bayes quantile??} \section*{Appendix: trouble-shooting optimization} \begin{itemize} \item{\textbf{make sure you understand the model you're fitting}} \item{check starting conditions} \item{check convergence conditions} \item{adjust \code{parscale}/restart from previous best fit} \item{switch from constraints to transformed parameters} \item{adjust finite-difference tolerances (\code{ndeps})} \item{switch to more robust methods (Nelder-Mead, SANN), or even just alternate methods} \item{stop with NAs: debug objective function, constrain parameters, put if clauses in objective function} \item{results depend on starting conditions: check slice between answers/around answers: multiple minima or just convergence problems?} \item{convergence problems: try restarting from previous stopping point, resetting parscale} \item{examine profile likelihoods} \end{itemize} \section*{\R\ supplement} \subsection{Testing hypotheses on boundaries by simulating the null hypothesis} \label{sec:nullhypsim} Suppose you want to test the hypothesis that the data set <<>>= x=c(0,0,0,0,0,0,0,1,1,1,1,1,2,2,2,2,2,3,4,5) @ comes from a negative binomial distribution against the null hypothesis that it is Poisson distributed with $\lambda=\bar x=\Sexpr{mean(x)}$. <>= fit.nb=fitdistr(x,"negative binomial") lnb = c(-logLik(fit.nb)) fit.pois=fitdistr(x,"Poisson") lpois = c(-logLik(fit.pois)) dev = c(-2*(lnb-lpois)) pval = pchibarsq(dev,1,lower.tail=FALSE) @ A negative binomial fit (\code{fit.nb=fitdistr(x,"negative binomial")}) gives a negative log-likelihood (\code{-logLik(fit.nb)}) of \Sexpr{round(lnb,2)}, while a Poisson fit (\code{fit.pois=fitdistr(x,"Poisson")}) gives a negative log-likelihood of \Sexpr{round(lpois,2)}. The Likelihood Ratio Test <>= devdiff = 2*(logLik(fit.nb)-logLik(fit.pois)) pchisq(devdiff,df=1,lower.tail=FALSE) @ <>= pval=pchisq(2*(logLik(fit.nb)-logLik(fit.pois)),df=1,lower.tail=FALSE) @ says that the $p$-value is \Sexpr{round(pval,2)}, but the corrected ($\bar \chi^2_1$) test (\code{pchibarsq(devdiff,df=1,lower.tail=FALSE)}) says that $p$ is only \Sexpr{round(pval,2)} --- still not significant but stronger evidence. To evaluate the hypothesis more thoroughly by simulation, we will set up a function that (1) simulates Poisson-distributed values with the appropriate mean; (2) fits a negative binomial and Poisson distributions (returning \code{NA} if the negative binomial fit should happen to crash) and (3) returns the deviance (twice the log-likelihood ratio): <>= simulated.dev = function() { simx = rpois(length(x),lambda=mean(x)) simfitnb = try(fitdistr(simx,"negative binomial")) if (inherits(simfitnb,"try-error")) return(NA) simfitpois = fitdistr(simx,"Poisson") dev=c(2*(logLik(simfitnb)-logLik(simfitpois))) } @ Now simulate 3000 such values, throw out the \code{NA}s, and count the number of replicates remaining: <>= set.seed(1001) devdist = replicate(3000,simulated.dev()) devdist = na.omit(devdist) nreps = length(devdist) @ Calculate the proportion of simulated values that exceed the observed deviance: this is the best estimate of the ``true'' $p$ value we can get. <<>>= obs.dev=2*(logLik(fit.nb)-logLik(fit.pois)) sum(devdist>=obs.dev)/nreps @ So, in this case where we have two reasons --- small sample size and a boundary condition --- to doubt the assumptions of Likelihood Ratio Test, the classical LRT turns out to be nearly four times too conservative, while the boundary-corrected version ($\bar \chi^2$) is only twice as conservative as it should be. \subsection{Nonlinear constraints by penalization} \label{constrlik} Using penalties to implement an equality constraint or a nonlinear constraint (neither of which can be done with built-in functions in \R) is reasonably straightforward: just add a penalty term to the negative log-likelihood. For best results, the penalty should start small and increase with increasing violation of the constraint (to avoid a discontinuity in the negative log-likelihood surface). For example, to find the best shape and scale parameters for the fish survival data while constraining the mean to equal a particular value \code{target.mu} (use the \code{fixed=} argument in \code{mle2} to specify the target value): <>= weiblikfun3 <- function(shape,scale,target.mu,penalty=1000) { mu <- meanfun(shape,scale) NLL = -sum(dweibull(time,shape=shape,scale=scale,log=TRUE)) pen = penalty*(mu-target.mu)^2 NLL+pen } w3 <- mle2(weiblikfun3,start=list(shape=0.9,scale=13),fixed=list(target.mu=13)) @ If you have a problem where the function behaves badly (generates infinite or \code{NaN} values) when the constraint is violated, then you don't want to calculate the likelihood for values outside the constraints. For example, if we had to restrict shape to be greater than zero we could use the following code snippet: <>= if (shape>0) { NLL = -sum(dweibull(time,shape=shape,scale=scale,log=TRUE)) pen = 0 } else { NLL = -sum(dweibull(time,shape=0.0001,scale=scale,log=TRUE)) pen = penalty*shape^2 } NLL+pen @ In other words, if the shape parameter is beyond the constraints, then use the likelihood value at the boundary of the feasible region and then add the penalty. To use this constrained likelihood function to calculate confidence limits on the mean, first, calculate the critical value of the negative log-likelihood: <<>>= critval <- -logLik(w1)+qchisq(0.95,1)/2 @ Second, define a function that finds the best fit for a specified value of the mean and returns the distance above the critical value (use the \code{data=} argument in \code{mle2} so that you can try out different values of the penalty): <<>>= pcritfun <- function(target.mu,penalty=1000) { mfit <- mle2(weiblikfun3, start=list(shape=0.85,scale=12.4), fixed=list(target.mu=target.mu), data=list(penalty=penalty)) lval <- -logLik(mfit) lval - critval } @ Third, define the range of mean values in which you think the lower confidence limit lies and use \code{uniroot} to search within this range for the point where the negative log-likelihood is exactly equal to the critical value: <>= lowx <- c(5,13) penlower <- uniroot(pcritfun,lowx)$root @ Do the same for the upper confidence limit: <>= upx <- c(14,30) penupper <- uniroot(pcritfun,upx)$root @ Try with a different value of the penalty: <>= uniroot(pcritfun,lowx,penalty=1e6)$root @ \subsection{Numeric derivatives} \label{sec:numderiv} Analytical derivatives are always faster and numerically stabler, but \R\ can compute numeric derivatives for you. For example, to compute the derivatives of the mean survival time at the maximum likelihood estimate: <<>>= shape <- coef(w1)["shape"] scale <- coef(w1)["scale"] numericDeriv(quote(scale*gamma(1+1/shape)),c("scale","shape")) @ (the \code{quote} inside the \code{numericDeriv} command prevents \R\ from evaluating the expression prematurely). Of course, you can always do the same thing yourself by hand: <<>>= dshape = 0.0001 x2 = scale*gamma(1+1/(shape+dshape)) x1 = scale*gamma(1+1/shape) (x2-x1)/dshape @ which agrees to two decimal places with the \code{numericDeriv} calculation. \subsection{Extracting information from BUGS and CODA output} R2WinBUGS returns its results as a \code{bugs} object, which can be plotted or printed. The \code{as.mcmc} function in the \code{emdbook} package will turn this object into an \code{mcmc.list} object for a multi-chain run, or an \code {mcmc} object for a single-chain run. \code{read.bugs} in the \code{R2WinBUGS} package also works, but requires an extra step. The \code{mcmc} and \code{mcmc.list} objects are more flexible --- they can be plotted and summarized in a variety of ways (\code{summary}, \code{HPDinterval}, \code{densityplot}, \ldots see the help for the \code{coda} package). Once you ensure that the chains in a multi-chain R2WinBUGS run have converged, you can use \code{lump.mcmc.list} in the \code{emdbook} package to collapse the \code{mcmc.list} object so you can inferences from the combined chains. Using the \code{reefgoby.bugs} object derived from the WinBUGS run on p.~\pageref{winbugsrun}, calculate the Bayesian credible interval: <<>>= reefgoby.coda <- as.mcmc(reefgoby.bugs) reefgoby.coda <- lump.mcmc.list(reefgoby.coda) ci.bayes <- HPDinterval(reefgoby.coda)["mean",] ##ci.bayesq <- quantile(reefgoby.coda,c(0.025,0.975)) @