Code
=gam(y~s(tt,fx=FALSE,k=6,sp=3.5))
out=gam(y~s(tt,fx=FALSE,k=6)) out
So far, we have considered the modelling of a response variable
In this chapter, we generalise the modelling to describe the dependence of the response variable
Just as with a generalised linear model, the general additive model relates a continuous or discrete response variable
Random part: The probability (mass or density) function of
Systematic part: This is a non-linear predictor equation:
Link function: This is a one-to-one function providing the link between the predictor equation
Here,
The spline theory in the previous chapters has assumed Gaussian (normally distributed) data and the identity link function. For non-Gaussian data and/or a non-identity link function, we need to replace the penalised least-squares criterion of Equation 4.2 with a penalised deviance:
When there are several smooth terms of order
Fitting smoothing splines is straightforward in practice using R. At the beginning of each R session, the first step is to load the package
The main command is
The syntax of the
=gam(y~s(tt,fx=FALSE,k=6,sp=3.5))
out=gam(y~s(tt,fx=FALSE,k=6)) out
fits a cubic smoothing spline to the dependent variable
In each of the above examples, output from
Thus, if
To plot a smoothing spline:
For example, suppose
# observation times
=1:20
tt# compute a dense set of numbers between 0 and 21: 0,0.1,...,20.9,21.0
= (0:210)/10
ttnew # compute predicted values at each of 0,0.1,...,20.9,21.0
=predict.gam(out,newdata=list(tt=ttnew))
pred# plot the original data
plot(tt,y)
# superimpose the smoothing spline
lines(ttnew,pred)
The textbook Elements of Statistical Learning, by Hastie, Tibshirani and Friedman (2nd Edn, 2011), refers to a case–control study of coronary heart disease (CHD) in South Africa.
<- read.table("https://richardpmann.com/MATH5824/Datasets/SAheart.txt", sep=",",head=T,row.names=1) hr
Variable | Description |
---|---|
cumulative tobacco consumption (kg) | |
family history of heart disease (Present, Absent) | |
age at onset of the disease (years) | |
case–control status (1 |
The dependent variable in our models will be
We can examine CHD in relation to tobacco consumption, age and family history of heart disease, with the following R commands:
# make variables in dataframe hr directly available
attach(hr)
# fit a generalised linear model with chd as dependent variable
= glm(formula='chd~tobacco+age+famhist',family='binomial')
glm1 # examine the results
summary(glm1)
Call:
glm(formula = "chd~tobacco+age+famhist", family = "binomial")
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.620593 0.444576 -8.144 3.83e-16 ***
tobacco 0.083004 0.025712 3.228 0.00125 **
age 0.048812 0.009452 5.164 2.42e-07 ***
famhistPresent 0.974791 0.220023 4.430 9.41e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 596.11 on 461 degrees of freedom
Residual deviance: 495.39 on 458 degrees of freedom
AIC: 503.39
Number of Fisher Scoring iterations: 4
All variables in this model are statistically significant: disease is positively related to the amount of tobacco consumed, age and family history of heart disease. However, this model assumes that the logit of the probability of disease is linearly related to both tobacco consumption and age (logit being the default link for Binomial). We can explore more flexible tobacco-consumption and age trends with the following generalised additive model:
library(mgcv)
Loading required package: nlme
This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
= gam(chd ~ s(tobacco,k=20)+s(age,k=20)+famhist, family='binomial')
gam1 summary.gam(gam1)
Family: binomial
Link function: logit
Formula:
chd ~ s(tobacco, k = 20) + s(age, k = 20) + famhist
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2379 0.1631 -7.592 3.15e-14 ***
famhistPresent 0.9628 0.2233 4.311 1.62e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(tobacco) 6.080 7.573 17.89 0.0179 *
s(age) 1.002 1.003 24.11 9.53e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.212 Deviance explained = 19.1%
UBRE = 0.083268 Scale est. = 1 n = 462
Here, the
These results show that the fitted smoothing spline of the dependence of CHD on tobacco consumption has an effective degrees of freedom (edf) of 6.080, while the dependence on age has edf of only 1.002, implying an almost linear age-dependence because a linear age term would have exactly 1 degree of freedom. See Section Section 5.5 for further details on the calculation of edf.
The significance of each of these smooth terms is given by the
The following R code was used to plot the fitted smooth functions of age and tobacco consumption:
# for larger axis labels
par(cex.lab=1.6)
# create synthetic data set for ages 15 to 65 having consumed
# no tobacco and with no family history of heart disease
= data.frame(age = seq(from=15,to=65,by=0.1),tobacco=0,
newdat1 famhist="Absent")
# predict logit probability of CHD in these synthetic individuals
= predict.gam(gam1,newdata=newdat1)
pred1 # plot the predicted logit probability of CHD by age in these
# synthetic individuals
plot(newdat1$age,pred1,xlab="Age",
ylab="Predicted logit probability of CHD", type="l")
# create synthetic data set for age 40 having consumed tobacco
# ranging from 0 to 32kg and with no family history of heart disease
= data.frame(age = 40,tobacco=seq(from=0,to=32,by=0.1),
newdat2 famhist="Absent")
# predict logit probability of CHD in these synthetic individuals
= predict.gam(gam1,newdata=newdat2)
pred2 # plot the predicted logit probability of CHD by tobacco in these imaginary individuals
plot(newdat2$tobacco,pred2,xlab="Tobacco consumption",
ylab="logit probability of CHD", type="l")
The predicted dependence of the logit probability of CHD on smooth functions of age and tobacco consumption is shown above. We see an almost linear predicted age-dependence, in agreement with its edf. The curious predicted dependence on tobacco consumption might be explained by other factors correlated with tobacco consumption.