Model diagnostics example

In this exercise we’ll fit a simple model and apply a variety of model diagnostics to it. Some may be very familiar …

library(arm) ## for sim()
library(descr)  ## for LogRegR2
require("reshape2")
## graphics prettiness
library("ggplot2")
theme_set(theme_bw()) 
library("grid")
zmargin <- theme(panel.margin=unit(0,"lines"))

Data on lizard perching behaviour, from the brglm package (and before that from McCullagh and Nelder (McCullagh and Nelder 1989), ultimately from Schoener (1970)).

lizards <- read.csv("data/lizards.csv")
## adjust factor levels to a sensible order
lizards$time <- factor(lizards$time,
                       levels=c("early","midday","late"))

A quick look at the data: response is fraction of Anolis grahami lizards found on perches in particular conditions. Plot univariate responses:

A more conventional plot:

(g1 <- ggplot(lizards,
    aes(x=time,y=gfrac,colour=height))+
  geom_point(aes(size=N))+
  geom_line(aes(group=height))+
  facet_grid(diameter~light,labeller=label_both)+zmargin)

Fit a basic (additive: no interactions) binomial GLM:

m1 <- glm(gfrac~time+height+light+diameter,
    weights=N,
    family="binomial",
    data=lizards)

Standard R diagnostic plots (fitted vs. residual, scale-location, Q-Q, influence):

op <- par(mfrow=c(2,2))  ## 2x2 subplot grid
plot(m1)

par(op) ## restore original parameters

An improved Q-Q plot, from Augustin et al. (2012) by way of the mgcv package:

library(mgcv)
qq.gam(m1,pch=16)

Check for overdispersion:

resid.ssq <- sum(residuals(m1,type="pearson")^2)  ## sum of squares of Pearson resids
resid.df <- nrow(lizards)-length(coef(m1))        ## estimated resid df (N-p)
resid.ssq/resid.df                                ## ratio should be approx 1

## [1] 0.7405512

Not overdispersed, apparently.

Compute various pseudo-\(R^2\) measures:

library(descr)
LogRegR2(m1)

## Chi2                 55.89726 
## Df                   5 
## Sig.                 8.532219e-11 
## Cox and Snell Index  0.911991 
## Nagelkerke Index     0.9574288 
## McFadden's R2        0.7973723

Use fortify(model_fit) to add the standard diagnostics (fitted values, residuals, standardized residuals, …) to the data from a model

m1F <- fortify(m1)
ggplot(m1F,aes(x=time,y=.resid))+geom_boxplot()+
    geom_point(size=3,alpha=0.5)+
    facet_grid(diameter~light)+zmargin

Uh-oh …

Other tests of distribution are a bit harder.

Or we can plot predicted values.

lPred <- predict(m1,se.fit=TRUE)  ## gives predictions on logit scale
## compute back-transformed predictions and confidence intervals
lizardsX <- transform(lizards,pred=plogis(lPred$fit),
              lwr=plogis(lPred$fit-2*lPred$se.fit),
              upr=plogis(lPred$fit+2*lPred$se.fit))
## add predictions/CIs to previous plot
g1 + geom_pointrange(data=lizardsX,shape=2,
            aes(y=pred,ymin=lwr,ymax=upr))

When you have continuous predictors or more complicated/unbalanced situations you will often want to construct your own data frame for predictions. For example, to get predictions for all light:time combinations for a specified diameter and height category:

predframe <- with(lizards,
                  expand.grid(light=levels(light),
                              time=levels(time),
                              diameter="<=2in",
                              height="<5ft"))
predict(m1,newdata=predframe)

Warning signs of two problems:

complete separation: all-zero or all-one in some categories (bias-reduced regression via logistf or brglm, or regularization via Bayesian (arm::bayesglm) or other approaches)
failure of the Wald approximation (Hauck-Donner effect, Hauck and Donner (1977))

Posterior predictive simulation

Pick some value (probe) of interest

betasim <- sim(m1,n.sims=500)
X <- model.matrix(m1)
simpreds <- plogis(X %*% t(coef(betasim)))
subset(lizards,gfrac==1.0)

##     X grahami opalinus height diameter light   time  N gfrac
## 4   4      13        0  >=5ft    <=2in sunny  early 13     1
## 5   5       8        0  >=5ft    <=2in sunny midday  8     1
## 6   6      12        0  >=5ft    <=2in sunny   late 12     1
## 10 10       6        0  >=5ft     >2in sunny  early  6     1

dim(simpreds)

## [1]  23 500

hist(simpreds[4,],breaks=50,col="gray",
     main="Posterior pred sim")

Bootstrapping

bootfun <- function() {
  
  bsamp <- sample(nrow(lizards),
                   size=nrow(lizards),
                 replace=FALSE)
  bmodel <- update(m1,data=lizards[bsamp,])
  bpred <- predict(bmodel,type="response")
}
bootpred <- replicate(500,bootfun())
hist(bootpred[4,],breaks=20,col="gray")

Cross-validation

library(boot)

Need to define a cost function cost(observed, fitted); default is avg squared error

cost <- function(r, pi = 0) mean(abs(r-pi)) ## use mean abs dev
cv1 <- cv.glm(lizards,m1)
str(cv1)

## List of 4
##  $ call : language cv.glm(data = lizards, glmfit = m1)
##  $ K    : num 23
##  $ delta: num [1:2] 0.016 0.0159
##  $ seed : int [1:626] 403 258 -1470087149 1555212249 -1039318305 -543890869 903448728 -1765165091 -882645339 -753600403 ...

cv1$delta

## [1] 0.01601797 0.01594105

Or do it by hand:

cverr <- numeric(nrow(lizards))
for (i in 1:nrow(lizards)) {
  cvdata <- lizards[-i,]
  cvmodel <- update(m1,data=cvdata)
  predval <- predict(cvmodel,newdata=lizards[i,],
                     type="response")
  cverr[i] <- cost(lizards$gfrac,predval)
}
hist(cverr,breaks=10,col="gray")

mean(cverr)

## [1] 0.1843507

This is leave-one-out cross-validation: \(K\)-fold is usually better (but maybe worth using cv.glm instead)

Notes

Problems in one area (outliers, non-linearity, etc.) can
Transforming data is generally the best solution for continuous data, isn’t really possible for non-continuous data; difficult for non-positive data
Transforming data helps with skew, not with fat tails – fat tails will give unbiased estimates, but poor confidence intervals; need robust statistics (library("robust"))
Can’t always fix everything, need to prioritize
See also: (Wickham et al. 2010)

Exercises

Change the model to incorporate two-way interactions (m2 <- update(m1,.~.^2)) and see if that seems to fix any problems we found in the model. Compare this with the statistical significance of the added terms (summary(m2) or drop1(m2,test="Chisq"))

References

Augustin, Nicole H., Erik-André Sauleau, and Simon N. Wood. 2012. “On Quantile Quantile Plots for Generalized Linear Models.” Computational Statistics & Data Analysis 56 (8): 2404–9. doi:10.1016/j.csda.2012.01.026.

Hauck, Walter W., and Allan Donner. 1977. “Wald’s Test as Applied to Hypotheses in Logit Analysis.” Journal of the American Statistical Association 72 (360): 851–53. doi:10.2307/2286473.

McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. London: Chapman; Hall.

Schoener, Thomas W. 1970. “Nonsynchronous Spatial Overlap of Lizards in Patchy Habitats.” Ecology 51 (3): 408–18. doi:10.2307/1935376.

Wickham, H., D. Cook, H. Hofmann, and Andreas Buja. 2010. “Graphical Inference for Infovis.” IEEE Transactions on Visualization and Computer Graphics 16 (6): 973–79. doi:10.1109/TVCG.2010.161.