In this exercise we’ll fit a simple model and apply a variety of model diagnostics to it. Some may be very familiar …
library(arm) ## for sim()
library(descr) ## for LogRegR2
require("reshape2")
## graphics prettiness
library("ggplot2")
theme_set(theme_bw())
library("grid")
zmargin <- theme(panel.margin=unit(0,"lines"))
Data on lizard perching behaviour, from the brglm
package (and before that from McCullagh and Nelder (McCullagh and Nelder 1989), ultimately from Schoener (1970)).
lizards <- read.csv("data/lizards.csv")
## adjust factor levels to a sensible order
lizards$time <- factor(lizards$time,
levels=c("early","midday","late"))
A quick look at the data: response is fraction of Anolis grahami lizards found on perches in particular conditions. Plot univariate responses:
A more conventional plot:
(g1 <- ggplot(lizards,
aes(x=time,y=gfrac,colour=height))+
geom_point(aes(size=N))+
geom_line(aes(group=height))+
facet_grid(diameter~light,labeller=label_both)+zmargin)
Fit a basic (additive: no interactions) binomial GLM:
m1 <- glm(gfrac~time+height+light+diameter,
weights=N,
family="binomial",
data=lizards)
Standard R diagnostic plots (fitted vs. residual, scale-location, Q-Q, influence):
op <- par(mfrow=c(2,2)) ## 2x2 subplot grid
plot(m1)
par(op) ## restore original parameters
An improved Q-Q plot, from Augustin et al. (2012) by way of the mgcv
package:
library(mgcv)
qq.gam(m1,pch=16)
Check for overdispersion:
resid.ssq <- sum(residuals(m1,type="pearson")^2) ## sum of squares of Pearson resids
resid.df <- nrow(lizards)-length(coef(m1)) ## estimated resid df (N-p)
resid.ssq/resid.df ## ratio should be approx 1
## [1] 0.7405512
Not overdispersed, apparently.
Compute various pseudo-\(R^2\) measures:
library(descr)
LogRegR2(m1)
## Chi2 55.89726
## Df 5
## Sig. 8.532219e-11
## Cox and Snell Index 0.911991
## Nagelkerke Index 0.9574288
## McFadden's R2 0.7973723
Use fortify(model_fit)
to add the standard diagnostics (fitted values, residuals, standardized residuals, …) to the data from a model
m1F <- fortify(m1)
ggplot(m1F,aes(x=time,y=.resid))+geom_boxplot()+
geom_point(size=3,alpha=0.5)+
facet_grid(diameter~light)+zmargin
Uh-oh …
Other tests of distribution are a bit harder.
Or we can plot predicted values.
lPred <- predict(m1,se.fit=TRUE) ## gives predictions on logit scale
## compute back-transformed predictions and confidence intervals
lizardsX <- transform(lizards,pred=plogis(lPred$fit),
lwr=plogis(lPred$fit-2*lPred$se.fit),
upr=plogis(lPred$fit+2*lPred$se.fit))
## add predictions/CIs to previous plot
g1 + geom_pointrange(data=lizardsX,shape=2,
aes(y=pred,ymin=lwr,ymax=upr))
When you have continuous predictors or more complicated/unbalanced situations you will often want to construct your own data frame for predictions. For example, to get predictions for all light:time combinations for a specified diameter and height category:
predframe <- with(lizards,
expand.grid(light=levels(light),
time=levels(time),
diameter="<=2in",
height="<5ft"))
predict(m1,newdata=predframe)
Warning signs of two problems:
logistf
or brglm
, or regularization via Bayesian (arm::bayesglm
) or other approaches)betasim <- sim(m1,n.sims=500)
X <- model.matrix(m1)
simpreds <- plogis(X %*% t(coef(betasim)))
subset(lizards,gfrac==1.0)
## X grahami opalinus height diameter light time N gfrac
## 4 4 13 0 >=5ft <=2in sunny early 13 1
## 5 5 8 0 >=5ft <=2in sunny midday 8 1
## 6 6 12 0 >=5ft <=2in sunny late 12 1
## 10 10 6 0 >=5ft >2in sunny early 6 1
dim(simpreds)
## [1] 23 500
hist(simpreds[4,],breaks=50,col="gray",
main="Posterior pred sim")
bootfun <- function() {
bsamp <- sample(nrow(lizards),
size=nrow(lizards),
replace=FALSE)
bmodel <- update(m1,data=lizards[bsamp,])
bpred <- predict(bmodel,type="response")
}
bootpred <- replicate(500,bootfun())
hist(bootpred[4,],breaks=20,col="gray")
library(boot)
Need to define a cost function cost(observed, fitted)
; default is avg squared error
cost <- function(r, pi = 0) mean(abs(r-pi)) ## use mean abs dev
cv1 <- cv.glm(lizards,m1)
str(cv1)
## List of 4
## $ call : language cv.glm(data = lizards, glmfit = m1)
## $ K : num 23
## $ delta: num [1:2] 0.016 0.0159
## $ seed : int [1:626] 403 258 -1470087149 1555212249 -1039318305 -543890869 903448728 -1765165091 -882645339 -753600403 ...
cv1$delta
## [1] 0.01601797 0.01594105
Or do it by hand:
cverr <- numeric(nrow(lizards))
for (i in 1:nrow(lizards)) {
cvdata <- lizards[-i,]
cvmodel <- update(m1,data=cvdata)
predval <- predict(cvmodel,newdata=lizards[i,],
type="response")
cverr[i] <- cost(lizards$gfrac,predval)
}
hist(cverr,breaks=10,col="gray")
mean(cverr)
## [1] 0.1843507
This is leave-one-out cross-validation: \(K\)-fold is usually better (but maybe worth using cv.glm
instead)
library("robust")
)m2 <- update(m1,.~.^2)
) and see if that seems to fix any problems we found in the model. Compare this with the statistical significance of the added terms (summary(m2)
or drop1(m2,test="Chisq")
)Augustin, Nicole H., Erik-André Sauleau, and Simon N. Wood. 2012. “On Quantile Quantile Plots for Generalized Linear Models.” Computational Statistics & Data Analysis 56 (8): 2404–9. doi:10.1016/j.csda.2012.01.026.
Hauck, Walter W., and Allan Donner. 1977. “Wald’s Test as Applied to Hypotheses in Logit Analysis.” Journal of the American Statistical Association 72 (360): 851–53. doi:10.2307/2286473.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. London: Chapman; Hall.
Schoener, Thomas W. 1970. “Nonsynchronous Spatial Overlap of Lizards in Patchy Habitats.” Ecology 51 (3): 408–18. doi:10.2307/1935376.
Wickham, H., D. Cook, H. Hofmann, and Andreas Buja. 2010. “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79. doi:10.1109/TVCG.2010.161.