<>= library(emdbook) library(bbmle) library(MASS) library(nlme) source("chapskel.R") @ \section*{Summary} This chapter rapidly reviews much of classical statistics, discussing the underlying likelihood models for procedures such as ANOVA, linear regression, and generalized linear models. It also gives brief pointers to the built-in procedures in \R\ that implement these standard techniques. This summary connects maximum likelihood approaches with more familiar classical techniques. If you're already familiar with classical techniques, it may help you understood maximum likelihood better. It also provides a starting point for using efficient, ``canned'' approaches when they are appropriate for your data. It does not, and cannot, provide full coverage of all these topics. For more details, see \cite{Dalgaard2003}, \cite{Crawley2005}, or \cite{MASS}. \section{Introduction} So far this book has covered maximum likelihood and Bayesian estimation in some detail. In the course of the discussion I have sometimes mentioned that maximum likelihood analyses give answers equivalent to those provided by familiar, ``old-fashioned'' statistical procedures. For example, the statistical model $Y \sim \mbox{Normal}(a+bx,\sigma^2)$ --- specifying that $Y$ is a normally distributed random variable whose mean depends linearly on $x$ --- underlies ordinary least-squares linear regression. This chapter will briefly review special cases where our general recipe for finding MLEs for statistical models reduces to standard procedures that are built into \R\ and other statistics packages. In the best case, your data will match a classical technique like linear regression exactly, and the answers provided by classical statistical models will agree with the results from your likelihood model. Other models you build may be formally equivalent to a classical model that is parameterized in a different way. Most often, the customized model you build will not be exactly equivalent to any existing classical model, but a similar classical model may be close enough that you wouldn't mind changing your model slightly in order to gain the convenience of using a standard procedure. For example, in Chapter~6 we used the model \begin{equation} Y \sim \mbox{NegBinom}(\mu=a \cdot \mbox{DBH}^b,k) \end{equation} to represent cone production by fir trees as a function of diameter at breast height. If we approximated the discrete distribution of cones by a continuous log-normal distribution instead, \begin{equation} Y \sim \mbox{LogNormal}(\mu=a \cdot \mbox{DBH}^b,\sigma^2), \end{equation} we could log-transform both sides and fit the linear regression model \begin{equation} \log Y \sim \mbox{Normal}(\log a+ b \cdot \log(\mbox{DBH}),\sigma^2). \end{equation} \begin{figure} <>= data(FirDBHFec) firdata = na.omit(FirDBHFec[,c("TOTCONES","DBH","WAVE_NON")]) firdata$TOTCONES = round(firdata$TOTCONES) cones = 1+firdata$TOTCONES lm1 = lm(log(cones)~log(DBH),data=firdata) lm2 = lm(cones~DBH,data=firdata) nbNLL.0 = function(a,b,k) { predcones = a*firdata$DBH^b -sum(dnbinom(firdata$TOTCONES,mu=predcones,size=k,log=TRUE)) } @ <>= nbfit.0 = mle2(nbNLL.0,start=list(a=1,b=1,k=1)) cones2 = cones[firdata$DBH>6 & firdata$DBH<8] ndb = fitdistr(cones2,"negative binomial") nln = fitdistr(cones2,"lognormal") ngm = fitdistr(cones2,"gamma") @ <>= op=par(mfrow=c(1,2),lwd=2,bty="l",las=1,cex=1.5) plot(firdata$DBH,cones,log="xy", xlab="DBH", ylab="cones+1",col="gray",cex=0.7,axes=FALSE) axis(side=2) axis(side=1,at=seq(5,20,by=5)) box() abline(v=c(6,8),col="darkgray") curve(exp(coef(lm1)[1])*x^coef(lm1)[2],add=TRUE) with(as.list(coef(nbfit.0)),curve(a*(x-1)^b,add=TRUE,lty=2)) curve(coef(lm2)[1]+coef(lm2)[2]*x,add=TRUE,lty=3) par(xpd=NA) legend(10.5,6,lty=1:3, c("power/LN", "power/NB", "lin/normal"), cex=0.9,bty="n") par(xpd=FALSE) plot(density(cones2,adjust=0.8,from=0),ylim=c(0,0.04), main="",xlab="cones+1",lty=2,xlim=c(-20,100), axes=FALSE) axis(side=2) axis(side=1,at=seq(0,100,by=25)) box() text(40,0.04,"6 < DBH < 8") points(cones2,runif(length(cones2),max=0.001),col="gray") with(as.list(ndb$estimate),lines(0:100,dnbinom(0:100,size=size,mu=mu),type="s"), lty=1) with(as.list(nln$estimate),curve(dlnorm(x,meanlog,sdlog),add=TRUE,lty=3)) curve(dnorm(x,mean(cones2),sd(cones2)),add=TRUE,lty=4) abline(v=0,col="darkgray") legend(25,0.038, lty=c(2,1,3,4), c("density","NB","LN","normal"),bty="n") ##with(as.list(ngm$estimate),curve(dgamma(x,shape=shape,rate=rate),add=TRUE,lty=4)) @ \caption{Comparing different functional forms for fir fecundity data: power-law with a lognormal (LN) distribution, power-law with a negative binomial (NB) distribution, and linear with a normal distribution. (The linear model appears as a curved line because the data are plotted on a log-log scale.)} \label{fig:formcomp} \end{figure} Figure~\ref{fig:formcomp}a shows all three models for the DBH--fecundity relationship --- % power-law with a negative binomial distribution (power/NB), power-law with a lognormal distribution (power/LN), and linear with a normal distribution --- fitted to the fir data; all are plausible. Figure~\ref{fig:formcomp}b shows various models for the distribution of cone production, fitted to the individuals with DBH between 6~and~8~cm: a nonparametric density estimate, the negative binomial, log-normal, and normal. The negative binomial is closest to the nonparametric density estimate of the distribution, while the lognormal is more peaked and the normal distribution has a significant (and unrealistic) negative tail. Although the power-law/negative binomial is the most realistic and has a plausible mechanistic interpretation (the data are discrete, positive, and overdispersed; we can imagine individual trees producing cones at an approximately constant rate with variation in fecundity among trees), the difference between the fit of negative binomial and lognormal distributions is small enough that the convenience of linear regression may be worthwhile. When the results of different models are similar on both biological and statistical grounds, you choose among them by balancing convenience, mechanistic arguments, and convention. Why might you want to use standard, special-case procedures rather than the general MLE approach? \begin{itemize} \item{\emph{Computational speed and stability}: the special-case procedures use special-case optimization algorithms that are faster (sometimes much faster) and less likely to encounter numerical problems. Many of these procedures relieve you of the responsibility of choosing starting parameters.} \item{\emph{Stable definitions}: the definitions of standard models have often been chosen to simplify parameter estimation. For example, to model a relatively sudden change between two states you could choose between a logistic equation or a threshold model. Both might be equally sensible in terms of the biology, but the logistic equation is easier to fit because it involves smoother changes as parameters change. Similarly, generalized linear models such as logistic or Poisson regression fit parameters on scales (logit- or log-transformed, respectively) that allow unconstrained optimization.} \item{\emph{Convention}: if you use a standard method, you can just say (for example) ``we used linear regression'' in your Methods section and no-one will think twice. If you use a non-standard method, you need to explain the method carefully and overcome readers' distrust of ``fancy'' statistics --- even if your model is actually simpler and more appropriate than any standard model. Similarly, it may minimize confusion to use the same models, and the same parameterizations, as previous studies of your system. } \item{\emph{Varying models and comparing hypotheses}: the machinery built into \R\ and other packages makes it easy to compare a variety of models. For example, when analyzing a factorial growth experiment that manipulates nitrogen (\code{N}) and phosphorus (\code{P}), you can easily switch between models incorporating the effects of nitrogen only (\verb+growth~N+), phosphorus only (\verb+growth~P+), additive effects of N and P (\verb!growth~N+P!), or the main effects plus interactions between nitrogen and phosphorus (\verb!growth~N*P!). You can carry out all of these comparisons by hand with your own models, and \code{mle2}'s formula interface is helpful, but \R's built-in functions make the process easy for classical models. } \end{itemize} This chapter discusses how a variety of different kinds of models fit together, and how they all represent special cases of a general likelihood framework. Figure~\ref{fig:allmodels} shows how many of these areas are connected. The chapter also gives \emph{brief} descriptions of how to use them in \R: if you want more details on any of these approaches, you'll need to check an introductory \citep{Dalgaard2003,Crawley2005,Verzani2005}, intermediate \citep{Crawley2002}, or advanced \citep{ChambersHastie1992,MASS} reference. \begin{figure} \includegraphics[width=5in]{models} \caption{All (or most) of statistics. The labels in parentheses (non-normal errors and nonlinearity) imply restricted cases: (non-normal errors) means exponential family (e.g. binomial or Poisson) distributions, while (nonlinearity) means nonlinearities with an invertible linearizing transformation. Models to the right of the gray dashed line involve multiple levels or types of variability; see Chapter~10.} \label{fig:allmodels} \end{figure} \section{General linear models} \emph{General linear models} include linear regression, one- and multi-way analysis of variance (ANOVA), and analysis of covariance (ANCOVA): \R\ uses the function \code{lm} for all of these procedures. SAS implements this with PROC GLM\footnote{This terminology is unfortunate since the rest of the world uses ``GLM'' to mean general\emph{ized} linear models, which correspond to SAS's PROC GENMOD.}. While regression, ANOVA, and ANCOVA are often handled differently, and they are usually taught differently in introductory statistics classes, they are all variants of the same basic model. The assumptions of the general linear model are that all observed values are independent and normally distributed with a constant variance (\emph{homoscedastic}), and that any continuous predictor variables (covariates) are measured without error. (Remember that the assumption of normality applies to the variation around the expected value --- the residuals --- not to the whole data set.) The ``linear'' part of ``general linear model'' means that the models are linear functions \emph{of the parameters}, not necessarily of the independent variables. For example, quadratic regression \begin{equation} Y \sim \mbox{Normal}(a + b x + c x^2,\sigma^2) \end{equation} is still linear in the parameters ($a$, $b$, $c$), and thus is a form of multiple linear regression. Another way to think about this is to say that $x^2$ is just another explanatory variables --- if you called it $w$ instead, it would be clear that this model is an example of multivariate linear regression. On the other hand, $Y \sim \mbox{Normal}(a x^b,\sigma^2)$ is nonlinear: it is linear with respect to $a$ (the second derivative of $a x^b$ with respect to $a$ is zero), but nonlinear with respect to $b$ ($d^2(a x^b)/db^2 = b\cdot (b-1)\cdot a x^{b-2} \neq 0$). \subsection{Simple linear regression} \label{sec:linreg} Simple, or ordinary, linear regression predicts $y$ as a function of a single continuous covariate $x$. The model is \begin{equation} Y \sim \mbox{Normal}(a + b x,\sigma^2), \end{equation} denoting the response variable as $Y$ rather than $y$ since it's a random variable. The equivalent \R\ code is <>= lm.reg = lm(y~x) @ The intercept term $a$ is implicit in the \R\ model. If you want to force the intercept to be equal to zero, fitting the model $Y \sim \mbox{Normal}(bx,\sigma^2)$, use \verb+lm(Y~X-1)+. Typing \code{lm.reg} by itself prints only the formula and the estimates of the coefficients; \code{summary(lm.reg)} also gives summary statistics (range and quartiles) of the residuals, standard errors and $p$-values for the coefficients, and $R^2$ and $F$ statistics for the full model; \code{coef(lm.reg)} gives the coefficients alone, and \code{coef(summary(lm.reg))} pulls out the table of estimates, standard errors, $t$ statistics, and $p$ values. \code{confint(lm.reg)} calculates confidence intervals. The function \code{plot(lm.reg)} displays various graphical diagnostics that show how well the assumptions of the model fit and whether particular points have a strong effect on the results: see \code{?plot.lm} for details. \code{anova(lm.reg)} prints an ANOVA table for the model% \footnote{\code{anova} gives so-called \emph{sequential sums of squares}, which SAS calls ``type I'' sums of squares. If you need SAS-style ``type III'' sums of squares, you can use the \code{Anova} function in the \code{car} package. However, be aware that type~III sums of squares are actually problematic, and indeed controversial \citep{Venables1998}.}. If you need to extract numeric values of, e.g., $R^2$ values or $F$ statistics for further analysis, wade through the output of \code{str(summary(lm.reg))} to find the pieces you need (e.g. \verb+summary(lm.reg)$r.squared+). To do linear regression by brute force with \code{mle2}, you could write this negative log-likelihood function <<>>= linregfun = function(a,b,sigma) { Y.pred = a+b*x -sum(dnorm(Y,mean=Y.pred,sd=sigma,log=TRUE)) } @ or use the formula interface: <>= mle2(Y~dnorm(mean=a+b*x,sd=sigma),start=...) @ When using \code{mle2} you must explicitly fit a standard deviation term $\sigma$ which is implicit in the \code{lm} approach. \subsection{Multiple linear regression} It's easy to extend the simple linear regression model to multiple continuous predictor variables (covariates). If the extra covariates are powers of the original variable ($x^2, x^3, \ldots$), the model is called \emph{polynomial} regression (\emph{quadratic} with just the $x^2$ term added): \begin{equation} Y \sim \mbox{Normal}(a+b_1 x + b_2 x^2,\sigma^2). \end{equation} Or you can use completely separate variables ($x_1, x_2, \ldots$): \begin{equation} Y \sim \mbox{Normal}(a+b_1 x_1 + b_2 x_2 + b_3 x_3,\sigma^2) \end{equation} As with simple regression, the intercept $a$ and the coefficients of the different covariates ($b_1, b_2$) are implicit in the \R\ formula: <>= lm.poly = lm(y~x+I(x^2)) @ (surround \verb!x^2! and other powers of \code{x} with \code{I()}, ``as is'') or <>= lm.mreg = lm(y~x1+x2+x3) @ You can add interactions among covariates, testing whether the slope with respect to one covariate changes linearly as a function of another covariate --- e.g. $Y \sim \mbox{Normal}(a + b_1 x_1 + b_2 x_2 + b_{12} x_1 x_2 ,\sigma^2)$: in \R, \verb!lm.intreg = lm(y~x1*x2)!. Use the \code{anova} function with \code{test="Chisq"} to perform likelihood ratio tests on a nested series of multivariate linear regression models (e.g. \code{anova(lm1,lm2,lm3,test="Chisq")}). If you wonder why \code{anova} is a test for regression models, remember that regression and analyses of variance are just different subsets of the general linear model. While multivariate regression is conceptually simple, models with many terms (e.g. models with many covariates or with multi-way interactions) can be difficult to interpret. Blind fitting of models with many covariates can get you in trouble \citep{Whittingham+2006}. If you absolutely must go on this kind of fishing expedition, you can use \code{step}, or \code{stepAIC} in the \code{MASS} package to do stepwise modeling, or \code{regsubsets} in the \code{leaps} package to search for the best model. \subsection{One-way analysis of variance (ANOVA)} If the predictor variables are discrete (factors) rather than continuous (covariate), the general linear model becomes an analysis of variance. The basic model is \begin{equation} Y_{i} \sim \mbox{Normal}(\alpha_i,\sigma^{2}); \end{equation} in \R\ it is <>= lm.1way = lm(y~f) @ where \code{f} is a factor. If your original data set has names for the factor levels (e.g. \{N,S,E,W\} or \{high,low\}) then \R\ will automatically transform the treatment variable into a factor when it reads in the data. However, if the factor levels look like numbers to \R\ (e.g. you have site designations 101, 227, and 359, or experiments numbered 1 to 5), \R\ will interpret them as continuous rather than discrete predictors, and will fit a linear regression rather than doing an ANOVA --- not what you want. Use \verb+v = factor(v)+ to turn a numeric variable \code{v} into a factor, and then fit the linear model. Executing \code{anova(lm.1way)} produces a basic ANOVA table; \code{summary(lm.1way)} gives a different view of the model, testing the significance of each parameter against the null hypothesis that it equals 0; for a factor with only two levels, these tests are statistically identical. When fitting regression models, the parameters of the model are easy to interpret --- they're just the intercept and the slopes with respect to the covariates. When you have factors in the model, however --- as in ANOVA --- the parameterization becomes trickier. By default, \R\ parameterizes the model in terms of the differences between the first group and subsequent groups (\emph{treatment contrasts}) rather than in terms of the mean of each group, although you can tell it to fit the means of each group by putting a \code{-1} in the formula (e.g. \verb+lm.1way = lm(y~f-1)+: see pp.~\pageref{contrasts}, \pageref{contrasts2}, and \pageref{contrasts3}). % If you were writing a negative log-likelihood model % for a one-way ANOVA with three treatment levels, % you'd probably write it like this: <>= aov1fun = function(m1,m2,m3,sigma) { Y.pred = c(m1,m2,m3)[f] -sum(dnorm(Y,mean=Y.pred,sd=sigma,log=TRUE)) } @ % The parameters \code{m1}, \code{m2}, and \code{m3} represent the means % of each group. However, \R\ parameterizes the model instead with a % parameter called \code{(Intercept)} that represents the mean of the % first group (i.e. the group associated with the first level of the % factor, which is the first name in an alphabetical list unless you % have changed the order by hand) and two parameters that give the % \emph{differences} between the baseline group mean and the means of % the other two two groups (in the old parameterization, these would be % equivalent to \code{m2-m1} and \code{m3-m1}). This parameterization % sets the parameters up so that each parameter represents a % \emph{contrast}; i.e., testing whether the value of a parameter is % zero is equivalent to testing a particular statistical hypothesis % \citep{Crawley2002}. %% reference earlier \subsection{Multi-way ANOVA} Multi-way ANOVA models $Y$ as a function of two or more different categorical variables (factors). For example, the full model for two-way ANOVA with interactions is \begin{equation} Y_{ij} \sim \mbox{Normal}(\alpha_i + \beta_j + \gamma_{ij},\sigma^2) \end{equation} where $i$ is the level of the first treatment/group, and $j$ is the level of the second. The \R\ code using \code{lm} is: <>= lm.2way = lm(Y~f1*f2) @ (\code{f1} and \code{f2} are factors). As before, \code{summary(lm.2way)} gives more information, testing whether the parameters differ significantly from zero; \code{confint(lm.2way)} computes confidence intervals; \code{anova(lm.2way)} generates a standard ANOVA table; \code{plot(lm.2way)} shows diagnostic plots. If you want to fit just the main effects without the interactions, use \code{lm(Y\~{}f1+f2)}; use \code{f1:f2} to specify an interaction between \code{f1} and \code{f2}. A negative log-likelihood function for \code{mle} could look like this: <<>>= aov2fun = function(m11,m12,m21,m22,sigma) { intval = interaction(f1,f2) Y.pred = c(m11,m12,m21,m22)[intval] -sum(dnorm(Y,mean=Y.pred,sd=sigma,log=TRUE)) } @ (\code{interaction(f1,f2)} defines a factor representing the interaction of \code{f1} and \code{f2} with levels in the order (\code{1.1}, \code{2.1}, \code{1.2}, \code{2.2})). Using the formula interface: <>= mle2(Y~dnorm(mean=m,sd=sigma),parameters=list(m~f1*f2)) @ For a multiway model, \R's parameters are again defined in terms of contrasts. If you construct a two-way ANOVA with factors \code{f1} (with levels \code{A} and \code{B}) and \code{f2} (with levels \code{I} and \code{II}), the first (``intercept'') parameter will be the mean of individuals in level \code{A} of the first factor and \code{I} of the second (\code{m11}); the second parameter is the difference between \code{A,II} and \code{A,I} (\code{m12-m11}); the third is the difference between \code{B,I} and \code{A,I} (\code{m21-m11}); and the fourth, the interaction term, is the difference between the mean of \code{B,II} and its expectation if the effects of the two factors were additive (\code{m22-(m11+(m12-m11)+(m21-m11))} = \code{m22-m12-m21+m11}). %%%% FIX ME: type III SS etc. In its \code{anova} tables, \R\ One difference between \R\ and other statistical packages to watch \subsection{Analysis of covariance (ANCOVA)} Analysis of covariance defines a statistical model that allows for different intercepts and slopes with respect to a covariate $x$ in different groups: \begin{equation} Y_i \sim \mbox{Normal}(\alpha_i + \beta_i x ,\sigma^2) \end{equation} In \R: <>= lm(Y~f*x) @ where $f$ is a factor and $x$ is a covariate (the formula \verb$Y~f+x$ would specify parallel slopes, \verb$Y~f$ would specify zero slopes but different intercepts, \verb$Y~x$ would specify a single slope). Figure~\ref{fig:firglm} shows the fit of the model \code{lm(log(TOTCONES+1) \~{} log(DBH)+WAVE\_NON)} to the fir data. As suggested by the figure, there is a strong effect of DBH but no significant effect of population (wave vs. non-wave). As with other general linear models, use \code{summary}, \code{confint}, \code{plot}, and \code{anova} to analyze the model. The parameters are now the intercept of the first factor level; the slope with respect to \code{x} for the first factor level; the differences in the intercepts for each factor level other than the first; and the differences in the slopes for each factor level other than the first. A negative log-likelihood function for ANCOVA: <<>>= ancovafun = function(i1,i2,slope1,slope2,sigma) { int = c(i1,i2)[f] slope = c(slope1,slope2)[f] Y.pred = int+ slope*x -sum(dnorm(Y,mean=Y.pred,sd=sigma,log=TRUE)) } @ \subsection{More complex general linear models} You can add factors (grouping variables) and interactions between factors in different ways to make multi-way ANOVA, covariates (continuous independent variables) to make multiple linear regression, and combinations to make different kinds of analysis of covariance. \R\ will automatically interpret formulas based on whether variables are factors or numeric variables. \begin{figure} <>= data(FirDBHFec) firdata = na.omit(FirDBHFec[,c("TOTCONES","DBH","WAVE_NON")]) firdata$TOTCONES = round(firdata$TOTCONES) cones = 1+firdata$TOTCONES logDBH=log(firdata$DBH) fir.dwi = lm(log(cones)~logDBH*WAVE_NON,data=firdata) fir.dw = lm(log(cones)~logDBH+WAVE_NON,data=firdata) fir.d = lm(log(cones)~logDBH,data=firdata) fir.w = lm(log(cones)~WAVE_NON,data=firdata) fir.0 = lm(log(cones)~1,data=firdata) @ <>= op=par(lwd=2,bty="l",las=1,cex=1.5,mar=c(5,4,2,4)+0.1,mgp=c(2.5,1,0)) plot(log(cones)~logDBH,pch=as.numeric(firdata$WAVE_NON), xlab="log(DBH)",ylab="log(cones+1)") abline(a=coef(fir.dw)[1],b=coef(fir.dw)[2]) abline(a=coef(fir.dw)[1]+coef(fir.dw)[3],b=coef(fir.dw)[2],lty=2) par(xpd=NA) legend(2.65,1.5,lty=1:2, pch=1:2,c("nonwave","wave"),bty="n") par(xpd=FALSE) par(op) @ \caption{General linear model fit to fir fecundity data (analysis of covariance): \code{lm(log(TOTCONES+1)\~{}log(DBH)+WAVE\_NON,data=firdata)}. (Lines are practically indistinguishable between groups.)} \label{fig:firglm} \end{figure} \section{Nonlinearity: nonlinear least squares} \emph{Nonlinear least squares} models relax the requirement of linearity, but keep the requirements of independence and normal errors. Two common examples are the power-law model with normal errors \begin{equation} Y \sim \mbox{Normal}(a x^b ,\sigma^2) \end{equation} and the Ricker model with normal errors \begin{equation} Y \sim \mbox{Normal}(a x e^{-r x},\sigma^2). \end{equation} Before computers were ubiquitous, the only practical way to solve these problems was to \emph{linearize} them by finding a transformation of the parameters (e.g. log-transforming $x$ and $y$ to do power-law regression). A lot of ingenuity went into developing transformation methods to linearize common functions. However, transforming variables changes the distribution of the error as well as the shape of the dependence of $y$ on $x$. Ideally we'd like to find a transformation that simultaneously produces a linear relationship and makes the errors normally distributed with constant variance, but these goals are often incompatible. If the errors are normal with constant variance, they won't be any longer after you transform the data to linearize $f(x)$. The modern way to solve these problems without distorting the error structure, or to solve other models that cannot be linearized by transforming them, is to minimize the sums of squares (equivalent to minimizing the negative log-likelihood) computationally, using quasi-Newton methods similar to those built into \code{optim}. Restricting the variance model to Normally distributed errors with constant variance allows the use of specific numeric methods that are more powerful and stable than the generalized algorithms that \code{optim} uses. In \R, use the \code{nls} command, specifying a nonlinear formula and the starting values (as a list): e.g., for the power model <>= n1 = nls(y~a*x^b,start=list(a=1,b=1)) @ \code{summary(n1)} shows values of parameters and standard errors; \code{anova(n1,...)} does likelihood ratio tests for nested sequences of nonlinear fits; and \code{confint(n1)} computes profile confidence limits which are more accurate than the confidence limits suggested by \code{summary(n1)}. (Unfortunately, \code{plot(n1)} does nothing.) Figure~\ref{fig:nls1} shows the fit of a nonlinear least-squares model (\verb!nls(TOTCONES~a*DBH^b)!) to the fir fecundity data set, along with the log-log fit (equivalent to a power-law fit with lognormal errors) calculated above. The power-lognormal model is probably better from a biological point of view, since the normal distribution can have negative values, but both models are reasonable. Fitting models with both nonlinear covariates and categorical variables (the nonlinear analogue of ANCOVA --- e.g., fitting different $a$ and $b$ parameters for wave and non-wave populations) is more difficult, but two functions from the \code{nlme} package, \code{nlsList} and \code{gnls} (generalized nonlinear least squares), can handle such models. \code{nlsList} does completely separate fits for separate groups --- for example, <>= nlsList(TOTCONES~a*DBH^b|WAVE_NON, data=firdata,start=list(a=0.1,b=2.7)) @ would fit separate $a$ and $b$ parameters for wave and non-wave populations --- but all parameters will vary among groups. <>= nlsList(TOTCONES~a*DBH^b|WAVE_NON,data=firdata,start=list(a=0.1,b=2.7)) gnls(TOTCONES~a*DBH^b,data=firdata,start=c(0.1,2.7,2.7), params=list(a~1,b~WAVE_NON)) @ The \code{gnls} command can fit models with only a subset of the parameters differing among groups: e.g. <>= gnls(TOTCONES~a*DBH^b,data=firdata,start=c(0.1,2.7,2.7), params=list(a~1,b~WAVE_NON)) @ will fit different $b$ parameters but the same $a$ parameter for wave and non-wave populations. While \code{nls} is more automated than \code{mle2} (for which you must specify the full negative log-likelihood function), the numerical methods it uses are similar to \code{mle2}'s in that (1) you must specify starting values and (2) if the starting values are unrealistic, or if the problem is otherwise difficult, the numerical optimization may get stuck. Errors such as \begin{verbatim} step factor [] reduced below 'minFactor' of ... \end{verbatim} \begin{verbatim} number of iterations exceeded maximum of ... \end{verbatim} or \begin{verbatim} Missing value or an infinity produced when evaluating the model \end{verbatim} indicate numerical problems. To solve these problems try to find better starting conditions, reparameterize your model, or play with the control options of \code{nls} (see \code{?nls.control}). As with ML models, you can often use simpler, more robust approaches like linear models to get a first estimate for the parameters (e.g. estimate the initial slope of a Michaelis-Menten function from the first 10\% of the data and the asymptote from the last 10\%, or estimate the parameters by linear regression based on a linearizing transform). \R\ includes some ``self-starting'' functions that do these steps automatically. The functions \code{SSlogis} and \code{SSmicmen}, for example, provide self-starting logistic and Michaelis-Menten functions. To fit a self-starting Michaelis-Menten model to the tadpole data with asymptote \code{a} and half-maximum \code{b}: <>= data(ReedfrogFuncresp) nls(Killed ~ SSmicmen(Initial, a, b), data = ReedfrogFuncresp) @ Use \code{apropos("SS",ignore.case=FALSE)} to see a more complete list of self-starting models: the names are cryptic, so check the help system to see what each model is. \begin{figure} <>= op=par(lwd=2,bty="l",las=1,cex=1.5,mgp=c(2.5,1,0)) firnls = nls(TOTCONES~a*DBH^b,start=list(a=0.1,b=2.7), data=firdata) plot(cones~DBH,data=firdata,ylab="Cones") with(as.list(coef(firnls)),curve(a*x^b,add=TRUE)) curve(exp(coef(fir.d)[1])*x^(coef(fir.d)[2]),add=TRUE,lty=2) legend("topleft",c("power/normal","power/LN"), lty=1:2,bty="n") par(op) @ \caption{A nonlinear least-squares fit to the fir fecundity data (\code{nls(TOTCONES\~{}a*DBH\^{}b,start=list(a=0.1,b=2.7,data=firdata))}); the linear model fit to the log-log data (equivalent to a power-law fit with lognormal errors) is also shown.} \label{fig:nls1} \end{figure} \textbf{Further reading:} \cite{BatesWatts1988}. \section{Non-normal errors: generalized linear models} \label{sec:GLM} Generalized linear models (not to be confused with general linear models) allow you to analyze models that have a particular kind of nonlinearity and particular kinds of non-normally distributed (but still independent) errors. Generalized linear models allow any nonlinear relationship that has a \emph{linearizing transformation}. That is, if $y=f(x)$, there must be some function $F$ such that $F(f(x))$ is a linear function of $x$. The procedure for fitting generalized linear models uses the function $F$ fit the data on the linearized scale ($F(y) = F(f(x))$) while calculating the expected variance on the untransformed scale in order to correct for the distortions that linearization would otherwise induce. In generalized-linear-model jargon $F$ is called the \emph{link} function. For example, when $f$ is the logistic curve ($y=f(x)=e^x/(1+e^x)$), the link function $F$ is a the logit function ($F(y)=\log(y/(1-y))=x$: see p.~\pageref{eq:logit} for the proof that the logit is really the inverse of the logistic). % \footnote{ % Proof that the logit transform linearizes the % logistic curve: % \begin{equation*} % \begin{split} % y & = e^x/(1+e^x) \\ % (1+e^x) y & = e^x \\ % y & = e^x(1-y) \\ % y/(1-y) & = e^x \\ % \log (y/(1-y)) & = x % \end{split} % \end{equation*} % }; \R\ knows about a variety of link functions including the log ($x=\log(y)$, which linearizes $y=e^x$); square-root ($x=\sqrt{y}$, which linearizes $y=x^2$); and inverse ($x=1/y$, which linearizes $y=1/x$): see \code{?family} for more possibilities. The class of non-normal errors that generalized linear models can handle is called the \emph{exponential family}. It includes Poisson, binomial, gamma and normal distributions, but not negative binomial or beta-binomial distributions. Each distribution has a standard link function: for example, the log link is standard for a Poisson and a logit link is standard for a binomial distribution. The standard link functions make sense for typical applications: for example, the logit transformation turns unconstrained values into values between 0 and 1, which are appropriate as probabilities in a binomial model. However, \R\ allows you some flexibility to change these associations for specific problems. GLMs are fit by a process called \emph{iteratively reweighted least squares}, which overcomes the basic problem that transforming the data to make them linear also changes the variance. The key is that given an estimate of the regression parameters, and knowing the relationship between the variance and the mean for a particular distribution, one can calculate the variance associated with each point. With this variance estimate, one re-estimates the regression parameters weighting each data point by the inverse of its variance; the new estimate gives new estimates of the variance; and so on. This procedure quickly and reliably fits the models, without the user needing to specify starting points. Generalized linear models combine a range of non-normal error distributions with the ability to work with some reasonable nonlinear functions. They also use the same simple model specification framework as \code{lm}, allowing us to explore combinations of factors, covariates, and interactions among variables. GLMs include logistic and binomial regression and log-linear models. They use terminology that should now be familiar to you; they estimate log-likelihoods and test the differences between models using the likelihood ratio test. The \code{glm} function implements generalized linear models in \R. By far the two most common GLMs are Poisson regression, for count data, and logistic regression, for survival/failure data. \begin{figure} <>= op=par(lwd=2,bty="l",las=1,cex=1.5,mgp=c(2.5,1,0)) data(ReedfrogFuncresp) glm1 = glm(cbind(Killed,Initial-Killed) ~ Initial, family=binomial,data=ReedfrogFuncresp) glm2 = glm(cbind(Killed,Initial-Killed) ~ Initial, family=binomial(link="log"),data=ReedfrogFuncresp) with(ReedfrogFuncresp,plot(Killed/Initial~Initial,xlab="Initial density", ylab="Fraction killed")) ivec = 1:100 lines(ivec,predict(glm1,newdata=list(Initial=ivec),type="response")) lines(ivec,predict(glm2,newdata=list(Initial=ivec),type="response"),lty=2) legend("topright",c("logistic regression","log-binomial model"), lty=1:2,bty="n") par(op) @ \caption[short]{Logistic (binomial) regression and log-binomial regression of fraction of tadpoles killed as a function of tadpole density. Logistic regression: \\ \code{glm(cbind(Killed,Initial-Killed)\~{}Initial, data=ReedfrogFuncresp,family="binomial")} \\ Log-binomial regression: \\ \code{glm(...,family=binomial(link="log"),...)} %% VERBATIM (and Schunk) doesn't work in captions? % \begin{verbatim} % glm(cbind(Killed,Initial-Killed)~Initial, % data=ReedfrogFuncresp,family="binomial") % \end{verbatim} % Log-binomial regression: % \begin{verbatim} % glm(cbind(Killed,Initial-Killed)~Initial, % data=ReedfrogFuncresp,family="binomial") % \end{verbatim} <>= glm(cbind(Killed,Initial-Killed)~Initial, data=ReedfrogFuncresp,family="binomial") glm(cbind(Killed,Initial-Killed)~Initial, data=ReedfrogFuncresp,family="binomial") @ } \label{fig:logistreg} \end{figure} \begin{itemize} \item{Poisson regression: log link, Poisson error ($Y \sim \mbox{Poisson}(a e^{bx})$); <>= glm1 = glm(y~x,family="poisson") @ The equivalent likelihood function is: <<>>= poisregfun = function(a,b) { Y.pred = exp(a+b*x) -sum(dpois(y,lambda=Y.pred,log=TRUE)) } @ } \item{Logistic regression: logit link, binomial error ($Y \sim \mbox{Binom}(p=\exp(a+bx)/(1+\exp(a+bx)),N)$): <>= glm2 = glm(cbind(y,N-y)~x,family="binomial") @ or <<>>= logistregfun = function(a,b) { p.pred = exp(a+b*x)/(1+exp(a+b*x)) -sum(dbinom(y,size=N,prob=p.pred,log=TRUE)) } @ (you could also say \code{p.pred=plogis(a+b*x)} in the first line of \code{logistregfun}). } \end{itemize} Another useful application of GLMs is fitting models of exponentially decreasing survival, $Y \sim \mbox{Binom}(p=\exp(a+bx),N)$. \cite{Strong+1999} modeled the survival probability of ghost moth caterpillars as a decreasing function of density (and as a function of the presence or absence of entomopathogenic nematodes); \cite{Tiwari+2006} modeled the probability that nesting sea turtles would \emph{not} dig up an existing nest as a decreasing function of nest density. You can fit such a model this way: <>= glm3 = glm(cbind(y,N-y)~x,family=binomial(link="log")) @ Use \code{family=binomial(link="log")} instead of \code{family="binomial"} to specify the log instead of the logit link function. The equivalent negative log-likelihood function is: <<>>= logregfun = function(a,b) { p.pred = exp(a+b*x) -sum(dbinom(y,size=N,prob=p.pred,log=TRUE)) } @ You can fit Vonesh's tadpole mortality data with either a logistic or a log-binomial model (Figure~\ref{fig:logistreg}), but the fact that expected survival decreases exponentially at high densities in both models causes problems of interpretation. If the probability of survival declines exponentially with density --- which is generally true for the log-binomial model and approximately true at high densities for the logistic --- then the expected \emph{number}. surviving is $p(x) \cdot x = e^{-(a+bx)} x = cxe^{-bx}$. This is a Ricker function, which decreases to zero at high density rather than reaching an asymptote. The standard type~II functional response model uses $p(x)=A/(1+Ahx)$, which has a weaker dependence on $x$ (exponentials are always stronger than powers of $x$), and so the limit of $p(x) x$ as $x$ becomes large is $1/h$. Thus, the GLM while convenient is not really appropriate in this case. <>= glm4 = glm(cbind(Killed,Initial-Killed) ~ Initial, family=quasi(link="inverse",variance="mu(1-mu)"), data=ReedfrogFuncresp) x = 10:20 p = 1/(1.2+5.5*x) y = rbinom(length(x),size=x,prob=p) testglm = glm(cbind(y,x-y)~x, family=quasi(link="inverse",variance="mu(1-mu)"), start=c(1.9,-0.4)) testglm = glm(cbind(y,x-y)~x, family=quasi(link="inverse",variance="constant"), start=c(1.2,5.5)) @ After you fit a GLM, you can use the same generic set of modeling functions --- % \code{summary}, \code{coef}, \code{confint}, \code{anova}, and \code{plot} --- to % examine the parameters, test hypotheses, and plot residuals. %%\todo{Check plot.glm -- sensible?} \code{anova(glm1,glm2,\ldots)} does an \emph{analysis of deviance} (likelihood ratio tests) on a nested sequence of models. As with \code{lm}, the default parameters represent (1) the intercept (the baseline value of the first treatment), (2) differences in the intercept between the first and subsequent treatments, (3) the slope(s) with respect to the covariate(s) for the first group, or (4) differences in the slope between the first and subsequent treatments. However, all of the parameters are given on the scale of the link function (e.g. log scale for Poisson models, logit scale for binomial models); you interpret them, you need to transform them with the inverse link function (exponential for Poisson, logistic (=\code{plogis}) for binomial). For example, the coefficients of the logistic regression shown in Figure~\ref{fig:logistreg} <>= a = round(coef(glm1)[1],3) b = round(coef(glm1)[2],4) @ are intercept=\Sexpr{a} slope=\Sexpr{b}. To find the probability of mortality at a tadpole density of 60, calculate $\exp(\Sexpr{a}+\Sexpr{b} \cdot 60)/(1+\exp(\Sexpr{a}+\Sexpr{b} \cdot 60)= \Sexpr{round(plogis(a+b*60),3)}$. \textbf{Further reading:} \cite{McCullaghNelder1989,Dobson1990,HastiePregibon1992,% Lindsay1997}. \R-specific: \cite{Crawley2002,Faraway2006}. \subsection{Models for overdispersion} To go beyond the exponential family of distributions (normal, binomial, Poisson, gamma) you may well need to roll your own ML estimator. \R\ has two built-in possibilities for the very common case of discrete data with \emph{overdispersion}, i.e. more variance than would be expected from the standard (Poisson and binomial) models for discrete data. \subsubsection{Quasilikelihood} \emph{Quasilikelihood} models ``inflate'' the expected variance of models to account for overdispersion \citep{McCullaghNelder1989}. For example, the expected variance of a binomial distribution with $N$ samples and probability $p$ is $Np(1-p)$. The \emph{quasibinomial} model adds another parameter, $\phi$, which inflates the variance to $\phi N p (1-p)$. The \emph{overdispersion parameter} $\phi$ (\cite{BurnhamAnderson2004} call it $\hat c$) is usually greater than 1 -- we usually find more variance than expected, rather than less. Quasi-Poisson models are defined similarly, with variance equal to $\phi \lambda$. This approach is called \emph{quasi}likelihood because we don't specify a real likelihood model with a probability distribution for the data. We just specify the relationship between the mean and the variance. Nevertheless, the quasilikelihood approach works well in practice. \R\ uses the \code{family} function to specify quasilikelihood models. Because the quasilikelihood is not a true likelihood, we cannot use likelihood ratio tests or other likelihood-based methods for inference, but the parameter estimates and $t$-statistics generated by \code{summary} should still work. However, various researchers have suggested that using an $F$ test based on the ratio of deviances should still be appropriate: use \code{anova(...,test="F")} \citep{Crawley2002,MASS}. \cite{BurnhamAnderson2004} suggest using differences in ``quasi-AIC'' (qAIC) in this case, where the $\Delta \mbox{qAIC}$ is the $\Delta \mbox{AIC}$ value divided by the estimate of $\phi$. Since the log is the default link function for the \code{quasipoisson} family, you can fit a quasi-Poisson log-log model for fecundity as follows: <>= glm(TOTCONES~log(DBH),data=firdata,family="quasipoisson") @ \subsubsection{Negative binomial models} Although the exponential family does not strictly include the negative binomial distribution, negative binomial models can be fit by a small extension of the GLM approach, iteratively fitting the $k$ (overdispersion) parameter and then fitting the rest of the model with a fixed $k$ parameter. The \code{glm.nb} function in the \code{MASS} package fits linear negative binomial models, although they restrict the model to a single $k$ parameter for all groups. (Use \verb!$theta! to extract the estimate of the negative binomial $k$ parameter from a negative binomial model.) Because we can use a log link, it turns out that we can exactly replicate our preferred log-likelihood model ($\mbox{cones} \sim \mbox{NegBinom}(a \cdot \mbox{DBH}^b,k)$) with the following command: <>= glm.nb(TOTCONES~log(DBH),data=firdata) @ The only difference from our earlier model is that the estimated intercept parameter is $\log(a)$ rather than $a$. \newpage \section*{\R\ supplement} Here's how to fit various linear models to the log-transformed fir data. Since the data (\code{TOTCONES}) contain some zero values, taking logarithms would give us negative infinite values. We either need to drop these values (\code{subset=TOTCONES>0}) or add an offset of 1, in order to avoid infinities. However, since there are few zeros in the data (\code{sum(firdata\$TOTCONES==0)} is \Sexpr{sum(firdata$TOTCONES==0)} out of a total of \Sexpr{nrow(firdata)} data points) and the mean number of cones is large, this adjustment shouldn't affect the results much. If zeros are frequent so that such an adjustment would be likely to affect your results significantly, or if the results vary depending on how big an offset you add, consider a different model (Section~\ref{sec:GLM}). <<>>= logcones = log(firdata$TOTCONES+1) lm.0 = lm(logcones~1,data=firdata) lm.d = lm(logcones~log(DBH),data=firdata) lm.w = lm(logcones~WAVE_NON,data=firdata) lm.dw = lm(logcones~log(DBH)+WAVE_NON,data=firdata) lm.dwi = lm(logcones~log(DBH)*WAVE_NON,data=firdata) @ Since \code{log(DBH)} is a covariate and \verb!WAVE_NON! is a factor, \code{lm.d} is a regression; \code{lm.w} is a one-way ANOVA; and \code{lm.dw} and \code{lm.dwi} are ANCOVA models with parallel and non-parallel slopes, respectively. A few different ways to analyze the data: <<>>= anova(lm.0,lm.d,lm.dw,lm.dwi) AIC(lm.0,lm.d,lm.w,lm.dw,lm.dwi) @ (I left \code{lm.w} out of the \code{anova} statement because it and \code{lm.d} cannot be nested.) \code{anova} compares the models sequentially, while \code{AIC} compares them simultaneously. \code{AICtab} in the \code{emdbook} offers a few more options such as sorting the table in order of increasing AIC or computing AIC weights. Try \code{coef}, \code{summary}, and \code{confint} on these models as well. The full ANCOVA model fit via \code{mle2}: <<>>= ancovafun = function(i1,i2,slope1,slope2,sigma) { int = c(i1,i2)[WAVE_NON] slope = c(slope1,slope2)[WAVE_NON] Y.pred = int+ slope*log(DBH) -sum(dnorm(logcones,mean=Y.pred,sd=sigma,log=TRUE)) } m1 = mle2(ancovafun,start=list(i1=-2,i2=-2,slope1=2.5,slope2=2.5,sigma=1), data=firdata) AIC(m1) @ The maximum likelihood fit gives the same AIC as the \code{lm} fit. You can't always take this equality for granted, since different models that are formally equivalent may include different constants in the likelihood, and different functions may count the number of parameters differently. As pointed out in the text, the models are parameterized differently: <<>>= coef(lm.dwi) coef(m1) @ You can check that the answers are equivalent: for example, the slope of the wave population is \code{slope2}=\Sexpr{round(coef(m1)["slope2"],3)}={} \code{logDBH}+\code{logDBH:WAVE\_NONw}. In order to do the full model comparison with \code{mle2}, you have to construct a series of nested models (analogous to \code{lm.dw}, \code{lm.d}, \code{lm.w}, \code{lm.0}). This is a bit tedious --- one reason for using built-in functions where possible. You may want to read about the \code{model.matrix} function, which can simplify model construction. \code{model.matrix} uses a user-specified formula to construct a \emph{design matrix} that, when multiplied by a vector of parameters, gives the expected value of each data point. By default the design matrix uses parameters that represent baseline levels and differences among groups, as in \code{lm} and \code{glm}. \code{mle2}'s formula interface uses \code{model.matrix} internally, so that (for example) you can easily fit the full ANCOVA model by specifying <>= mle2(log(TOTCONES+1)~dnorm(logDBH*WAVE_NON),data=firdata,start=...) @