12.5 Nonlinear Modeling

In practice the linearity assumption is almost never met and the regression function \[y=f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}]\] has to be approximated by some other technique. Or does it?

The linearity assumption is often “good enough” in spite of it not being met, and, coupled with its convenience of use and its multiple extensions, it is rarely a waste of time to give that approach a try.

When heavier machinery is required, it pays to consider the following OLS generalizations, which offer a lot of flexibility without sacrificing ease of interpretability, before jumping to so-called black box models (SVM, ANN, ensemble learning, etc.) of Module 13:

  • curve fitting (polynomial regression, step functions, splines, etc.);

  • local regression methods, or

  • generalized additive models.

12.5.1 Basis Function Models

If we have reason to suspect that the response \(Y\) is not a linear combination of the predictors, we might benefit from using a derived set of predictors (see [3, Sec. 7.3]).

Polynomial Regression

We can extend the simple linear model \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\), \(i=1,\ldots, N\), by allowing for polynomial basis terms in the regression function: \[y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\cdots+\beta_dx_i^d+\varepsilon_i,\quad i=1,\ldots, N.\] The regression function is non-linear in terms of the observations \(x_i\), but it is linear in terms of the coefficients \(\beta_j\) (or in terms of the predictors \(\{x,x^2,\ldots,x^d\}\)). We thus create new variables \(X_1=X\), \(X_2=X^2\), and so on, and estimate the regression function \(y=f(\mathbf{x})\) via \(\hat{f}(\mathbf{x})=\mathbf{x}^{\!\top}\mathbf{\hat{\beta}}\), where the coefficients \(\mathbf{\hat{\beta}}\) are learned using the training set \(\text{Tr}\).

Typically, the coefficient values are of little interest – it is the predictions \(\hat{f}(\mathbf{\tilde{x}})\) that are sought.

It is easy to obtain and estimate for \(\text{Var}(\hat{f}(\mathbf{\tilde{x}}))\) since \(\hat{f}(\mathbf{\tilde{x}})\) is linear in the coefficients \(\hat{\beta}_i\), \(i=0,\ldots, d\): \[\begin{aligned} \text{Var}(\hat{f}(\mathbf{\tilde{x}}))&=\text{Var}(\mathbf{\tilde{x}}^{\!\top}\mathbf{\hat{\beta}})=\sum_{i,j=0}^d\text{Cov}(\hat{\beta}_i\tilde{x}_i,\hat{\beta}_j\tilde{x}_j)\\&=\sum_{i,j=0}^d\tilde{x}_i\tilde{x}_j\text{Cov}(\hat{\beta}_i,\hat{\beta}_j)=\mathbf{\tilde{x}}^{\!\top}\text{Cov}(\mathbf{\hat{\beta}})\mathbf{\tilde{x}}\\&=\sigma^2\mathbf{\tilde{x}}^{\!\top}(\mathbf{\mathbf{X}^{\!\top}\mathbf{X}})^{-1}\mathbf{\tilde{x}}.\end{aligned}\] The estimated variance of the approximation at \(\mathbf{\tilde{x}}\) is thus \[\hat{\text{Var}}(\hat{f}(\mathbf{\tilde{x}}))= \frac{\text{SSRes}}{n-d-1}\mathbf{\tilde{x}}^{\!\top}(\mathbf{\mathbf{X}^{\!\top}\mathbf{X}})^{-1}\mathbf{\tilde{x}}= \frac{\|\mathbf{Y}-\mathbf{X}\mathbf{\hat{\beta}}\|_2^2}{n-d-1}\mathbf{\tilde{x}}^{\!\top}(\mathbf{\mathbf{X}^{\!\top}\mathbf{X}})^{-1}\mathbf{\tilde{x}},\] with \(\text{se}(\hat{f}(\mathbf{\tilde{x}}))=\sqrt{\hat{\text{Var}}(\hat{f}(\mathbf{\tilde{x}}))}\), so that \[\hat{f}(\mathbf{\tilde{x}})\pm 2 \text{se}(\hat{f}(\mathbf{\tilde{x}}))\] constitutes a 95% \(\text{C.I.}\) for \(\hat{f}(\mathbf{\tilde{x}})\), assuming normality of the error terms.

Gapminder Example

The charts below show polynomial regressions (\(d=4\)) and confidence intervals for life expectancy against 4 different predictors in the 2011 Gapminder data (assuming that the training set \(\text{Tr}\) is the entire dataset).

par(mfrow=c(2,2))
plot1 <- ggplot2::ggplot(gapminder.2011, ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) +
  ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) + 
  ggplot2::ggtitle("Polynomial regression - d=4") + 
  ggplot2::theme_bw()

plot2 <- ggplot2::ggplot(gapminder.2011, ggplot2::aes(x=infant_mortality, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) +
  ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) + 
  ggplot2::ggtitle("Polynomial regression - d=4") + 
  ggplot2::theme_bw()

plot3 <- gapminder.2011 |> dplyr::mutate(lgdppc=log(gdp/population)) |> 
  ggplot2::ggplot(ggplot2::aes(x=lgdppc, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) +
  ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) + 
  ggplot2::ggtitle("Polynomial regression - d=4") + 
  ggplot2::theme_bw()

plot4 <- ggplot2::ggplot(gapminder.2011, ggplot2::aes(x=gdp, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) +
  ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) + 
  ggplot2::ggtitle("Polynomial regression - d=4") + 
  ggplot2::theme_bw()

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

In this example, we picked \(d=4\). How do we select \(d\), in general? We can either pick a reasonable small \(d\) (often below 4) or use cross-validation to select a \(d\) that minimizes the estimated \(\text{MSE}_\text{Te}\).

Note that it is easy to incorporate more than one predictor and interaction terms into the model.

The nature of polynomials (\(|f(\mathbf{x})|\to \infty\) when \(\|\mathbf{x}\|\to\infty\)) is such that tail behaviour is usually quite horrible (look at the bottom-right example above). Consequently, polynomial regression should be used very carefully, staying within the domain and making sure to centre the predictors to reduce variance inflation.

Step Functions

Polynomial regression is an attractive approach because of the ease with which we can use the apparatus of OLS, but the elephant in the room is that we are imposing a global structure on the non-linear function \(y=f(\mathbf{x})\), and that cannot always be justified.

Step functions can be used to keep things “local”. Let \(c_i\), \(i=1,\ldots, K\) lie in \(\text{range}(X)\) and consider the following \(K+1\) new predictors: \[\begin{aligned} C_0(X)&=\mathcal{I}(X<c_1) \\ C_i(X)&=\mathcal{I}(c_i\leq X<c_{i+1}),\ i=1,\ldots, K-1 \\ C_K(X)&=\mathcal{I}(c_k\leq X),\end{aligned}\] where \(\mathcal{I}\) is the indicator function \[\mathcal{I}(\alpha)=\begin{cases}0, & \text{$\alpha$ is false} \\ 1, & \text{$\alpha$ is true} \end{cases}\] For any \(X\), \(C_0(X)+C_1(X)+\cdots+C_K(X)=1\), since \(X\) lies in exactly one of the intervals \[(-\infty,c_1),\ [c_1,c_2),\ \cdots,\ [c_{K-1},C_K) ,\ [C_K,\infty).\] The step function regression model is \[Y_i=\beta_0+\beta_1C_1(X_i)+\cdots+\beta_KC_K(X_i)+\varepsilon_i,\quad i=1,\ldots,N;\] it can also be obtained using the OLS framework.216

For a given \(X\), at most one of \(C_1(X),\ldots,C_K(X)\) is \(\neq 0\); thus, when \(X<c_1\), \(C_j(X)=0\) for all \(j=1,\ldots, K\), and so \[\beta_0=\text{Avg}[Y\mid X<c_1].\] For \(X\in [c_j,c_{j+1})\), \(\hat{y}=\beta_0+\beta_j\), so \(\beta_j\) represents the average increase in \(Y\) for \([c_j,c_{j+1})\) relative to \((-\infty,c_1)\).

The only major challenge with step function regression is that there is no easy way to find the number \(K\) and select the position of the breakpoints \(c_1,\ldots,c_K\), unless there are natural gaps in the predictors. We will discuss a strategy to determine the number and location of knots when we discuss classification and regression trees in Module 13.

We did not discuss how how step function regression or polynomial regression could be achieved in higher dimensions, but the principle remains the same (except that the number of parameters increases drastically, which can create some overfitting issues).

Gapminder Example

The charts below show step function regressions and confidence intervals for life expectancy against 4 different predictors in the 2011 Gapminder data (assuming that the training set \(\text{Tr}\) is the entire dataset).

We start by building a \(K=3\) knots step function model for life expectancy against fertility, using the (arbitrary) knot values at 2, 4, and 6:

gapminder.2011 <- gapminder.2011 |> 
  dplyr::mutate(fert0=I(fertility<2),
         fert1=I(2<=fertility & fertility<4),
         fert2=I(4<=fertility & fertility<6),
         fert3=I(6<=fertility))
model.sf.1 = lm(life_expectancy~fert0+fert1+fert2,data=gapminder.2011)
summary(model.sf.1)

Call:
lm(formula = life_expectancy ~ fert0 + fert1 + fert2, data = gapminder.2011)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.0485  -2.8300   0.2515   3.9669  12.1515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  60.5444     1.9554  30.963  < 2e-16 ***
fert0TRUE    16.8106     2.0969   8.017 2.04e-13 ***
fert1TRUE    10.2040     2.0845   4.895 2.36e-06 ***
fert2TRUE     0.7814     2.2212   0.352    0.725    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.866 on 162 degrees of freedom
Multiple R-squared:  0.5308,    Adjusted R-squared:  0.5221 
F-statistic: 61.09 on 3 and 162 DF,  p-value: < 2.2e-16

The corresponding step function is defined with:

g.1 <- function(x){
    model.sf.1$coefficients[1] +
    model.sf.1$coefficients[2]*I(x<2) +
    model.sf.1$coefficients[3]*I(2<=x & x<4) +
    model.sf.1$coefficients[4]*I(6<=x)
  }

We next build a \(K=4\) knots step function model for life expectancy against infant mortality, using the (arbitrary) knot values 10, 20, 40, 70:

gapminder.2011 <- gapminder.2011 |> 
  dplyr::mutate(inf0=I(infant_mortality<10),
         inf1=I(10<=infant_mortality & infant_mortality<20),
         inf2=I(20<=infant_mortality & infant_mortality<40),
         inf3=I(40<=infant_mortality & infant_mortality<70),
         inf4=I(70<=infant_mortality))
model.sf.2 = lm(life_expectancy~inf0+inf1+inf2+inf3,data=gapminder.2011)
summary(model.sf.2)

g.2 <- function(x){
    model.sf.2$coefficients[1] +
    model.sf.2$coefficients[2]*I(x<10) +
    model.sf.2$coefficients[3]*I(10<=x & x<20) +
    model.sf.2$coefficients[4]*I(20<=x & x<40) +
    model.sf.2$coefficients[5]*I(40<=x & x<70)
  }

Call:
lm(formula = life_expectancy ~ inf0 + inf1 + inf2 + inf3, data = gapminder.2011)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.9800  -2.3800   0.3725   2.7622   9.3200 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   56.675      1.207  46.939  < 2e-16 ***
inf0TRUE      22.017      1.340  16.437  < 2e-16 ***
inf1TRUE      17.963      1.390  12.928  < 2e-16 ***
inf2TRUE      11.782      1.429   8.247 5.46e-14 ***
inf3TRUE       5.305      1.399   3.791 0.000211 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.183 on 161 degrees of freedom
Multiple R-squared:  0.763, Adjusted R-squared:  0.7571 
F-statistic: 129.5 on 4 and 161 DF,  p-value: < 2.2e-16

We next build a \(K=3\) knots step function model for life expectancy against the log of gdp per capita, using the (arbitrary) knot values at 6, 8, 10:

gapminder.2011 <- gapminder.2011 |> dplyr::mutate(lgdppc=log(gdp/population)) |> 
  dplyr::mutate(lgdppc0=I(lgdppc<6),
         lgdppc1=I(6<=lgdppc & lgdppc<8),
         lgdppc2=I(8<=lgdppc & lgdppc<10),
         lgdppc3=I(10<=lgdppc))
model.sf.3 = lm(life_expectancy~lgdppc0+lgdppc1+lgdppc2,data=gapminder.2011)
summary(model.sf.3)

g.3 <- function(x){
    model.sf.3$coefficients[1] +
    model.sf.3$coefficients[2]*I(x<6) +
    model.sf.3$coefficients[3]*I(6<=x & x<8) +
    model.sf.3$coefficients[4]*I(8<=x & x<10)
  }

Call:
lm(formula = life_expectancy ~ lgdppc0 + lgdppc1 + lgdppc2, data = gapminder.2011)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.771  -1.831   0.550   3.691   9.789 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   81.100      1.270  63.857  < 2e-16 ***
lgdppc0TRUE  -20.788      1.708 -12.174  < 2e-16 ***
lgdppc1TRUE  -12.629      1.453  -8.692 3.75e-15 ***
lgdppc2TRUE   -6.012      1.509  -3.984 0.000102 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.82 on 162 degrees of freedom
Multiple R-squared:  0.5382,    Adjusted R-squared:  0.5296 
F-statistic: 62.93 on 3 and 162 DF,  p-value: < 2.2e-16

Finally, we build a \(K=6\) knots step function model for life expectancy against the log of gdp per capita, using the (arbitrary) knot values at 5, 6, 7, 8, 9, 10:

gapminder.2011 <- gapminder.2011 |> dplyr::mutate(lgdppc=log(gdp/population)) |> 
  dplyr::mutate(lgdppc0=I(lgdppc<5),
         lgdppc1=I(5<=lgdppc & lgdppc<6),
         lgdppc2=I(6<=lgdppc & lgdppc<7),
         lgdppc3=I(7<=lgdppc & lgdppc<8),
         lgdppc4=I(8<=lgdppc & lgdppc<9), 
         lgdppc5=I(9<=lgdppc & lgdppc<10), 
         lgdppc6=I(10<=lgdppc))
model.sf.4 = lm(life_expectancy~lgdppc0+lgdppc1+lgdppc2+lgdppc3+lgdppc4+lgdppc5,data=gapminder.2011)
summary(model.sf.4)

g.4 <- function(x){
    model.sf.4$coefficients[1] +
    model.sf.4$coefficients[2]*I(x<5) + 
    model.sf.4$coefficients[3]*I(5<=x & x<6) + 
    model.sf.4$coefficients[4]*I(6<=x & x<7) + 
    model.sf.4$coefficients[5]*I(7<=x & x<8) +
    model.sf.4$coefficients[6]*I(8<=x & x<9) +
    model.sf.4$coefficients[7]*I(9<=x & x<10)
  }

Call:
lm(formula = life_expectancy ~ lgdppc0 + lgdppc1 + lgdppc2 + 
    lgdppc3 + lgdppc4 + lgdppc5, data = gapminder.2011)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.8250  -1.3500   0.5964   3.1841  12.2929 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   81.100      1.204  67.367  < 2e-16 ***
lgdppc0TRUE  -21.300      4.082  -5.217 5.59e-07 ***
lgdppc1TRUE  -20.746      1.648 -12.585  < 2e-16 ***
lgdppc2TRUE  -15.993      1.593 -10.042  < 2e-16 ***
lgdppc3TRUE  -10.275      1.487  -6.912 1.10e-10 ***
lgdppc4TRUE   -7.187      1.559  -4.610 8.24e-06 ***
lgdppc5TRUE   -4.190      1.724  -2.431   0.0162 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.517 on 159 degrees of freedom
Multiple R-squared:  0.5927,    Adjusted R-squared:  0.5774 
F-statistic: 38.57 on 6 and 159 DF,  p-value: < 2.2e-16

The step functions in each of the 4 cases are displayed below:

plot1 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::stat_function(fun=g.1) + 
  ggplot2::ggtitle("Step Function Regression - K=3") + 
  ggplot2::theme_bw()

plot2 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=infant_mortality, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::stat_function(fun=g.2) + 
  ggplot2::ggtitle("Step Function Regression - K=4") + 
  ggplot2::theme_bw()

plot3 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=lgdppc, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::stat_function(fun=g.3) + 
  ggplot2::ggtitle("Step Function Regression - K=3") + 
  ggplot2::theme_bw()

plot4 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=lgdppc, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::stat_function(fun=g.4) + 
  ggplot2::ggtitle("Step Function Regression - K=6") + 
  ggplot2::theme_bw()

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

12.5.2 Splines

We can combine polynomial regression and step functions to obtain a more flexible curve fitting approach.

Regression Splines

Instead of fitting a polynomial over the entire range of the predictor \(X\), we use different polynomials (of degree up to \(3\), usually) in regions \(R_k\) (defined by knots in the 1-dimensional case), such as: \[Y_i=\begin{cases} \beta_{0,1}+\beta_{1,1}X_{i}+\beta_{2,1}X_i^2+\beta_{3,1}X_i^3 +\varepsilon_i, \quad\text{if $X_i\in R_1$} \\ \beta_{0,2}+\beta_{1,2}X_{i}+\beta_{2,2}X_i^2+\beta_{3,2}X_i^3 +\delta_i, \quad\text{if $X_i\in R_2$} \end{cases}\] Various constraints can be imposed on the polynomials:

  • none;

  • continuity at each region’s borders;

  • \(C^1\) (continuously differentiable) at each region’s borders;

  • etc.

 
In a sense to be defined shortly, splines have the “maximum” amount of continuity. Note that using more regions leads to a more flexible fit.

In what follows, we assume that the domain is split into \(K+1\) regions, bounded by knots (there are thus \(K\) such knots). If we impose no restriction on the functions, we are trying to fit \(K+1\) piecewise cubic functions to the data; each polynomial has 4 parameters to be estimated, leading to \(4(K+1)\) effective parameters.

If we impose a continuous fit (the polynomials must agree at the knots), we reduce the number of effective parameters. We can also require a continuously differentiable fit (the derivatives must also agree at the knots), further reducing the number of effective parameters.

A cubic spline (with only \(K+4\) parameters to fit) is a regression spine which is \(C^2\) on its domain.

Let \(\xi\) be a knot and \(x\) be a predictor value. The positive part function is defined by \[w_+=\begin{cases}w & \text{if $w>0$} \\ 0 & \text{else}\end{cases}\] Formally, the linear spline requires \(\xi_1,\ldots,\xi_K\) knots and has \(K+1\) effective parameters. The model can be expressed simply using positive parts: \[\begin{aligned} y_i&=\beta_0+\beta_1x_i+\beta_2(x_i-\xi_1)_++\cdots+\beta_{K+1}(x_i-\xi_K)_++\varepsilon_i;\end{aligned}\] the cubic spline is: \[\begin{aligned} y_i&=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3 +\beta_4(x_i-\xi_1)_+^3+\cdots+\beta_{K+3}(x_i-\xi_K)_+^3+\varepsilon_i,\end{aligned}\] and the natural cubic spline is a cubic spline between \(\xi_1\) and \(\xi_K\), with linear extrapolation beyond \(\xi_1\) and \(\xi_K\); this adds \(4\) extra constraints to the cubic spline and allows for more knots while keeping the number of effective parameters identical to that of the linear spline.

In all instances, the machinery of OLS remains available: predictions, diagnostics, remedial measures, confidence intervals, and extension to logistic regression, as needed.

Various splines with one knot on a dataset. [@DSML_ISLR]

Figure 12.12: Various splines with one knot on a dataset. [3]

Gapminder Example

The charts below show cubic splines for life expectancy against fertility in the 2011 Gapminder data (assuming that the training set \(\text{Tr}\) is the entire dataset).

Cubic splines are modeled using the splines package bs() function. In theory, we place more knots in locations where the spline functions is believed to vary more rapidly, and fewer knots where it is more stable. In practice, the knots are placed uniformly at quantiles of the predictor variable \(X\), based on their number.

The syntax for the OLS model formula in R follows the form

response ~ splines::bs(predictor, df)

where the degrees of freedom df are linked to the number of parameters to estimate (in the case of cubic spline, \(\text{df}=K+3\)).

We start by building a cubic spline with \(K=0\) knot (so \(K+3=3\) degrees of freedom).

(fm0 <- lm(life_expectancy ~ splines::bs(fertility, df = 3)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 3))

Coefficients:
                    (Intercept)  splines::bs(fertility, df = 3)1  
                          79.28                           -11.47  
splines::bs(fertility, df = 3)2  splines::bs(fertility, df = 3)3  
                         -27.41                           -16.65  

We can also build cubic splines with \(K=1\) knot:

(fm1 <- lm(life_expectancy ~ splines::bs(fertility, df = 4)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 4))

Coefficients:
                    (Intercept)  splines::bs(fertility, df = 4)1  
                         76.772                            3.042  
splines::bs(fertility, df = 4)2  splines::bs(fertility, df = 4)3  
                        -17.886                          -17.515  
splines::bs(fertility, df = 4)4  
                        -16.842  

Similarly, for \(K=2\) knots:

(fm2 <- lm(life_expectancy ~ splines::bs(fertility, df = 5)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 5))

Coefficients:
                    (Intercept)  splines::bs(fertility, df = 5)1  
                         78.395                           -1.110  
splines::bs(fertility, df = 5)2  splines::bs(fertility, df = 5)3  
                         -1.318                          -22.715  
splines::bs(fertility, df = 5)4  splines::bs(fertility, df = 5)5  
                        -16.188                          -19.721  

And for \(K=10\) knots:

(fm10 <- lm(life_expectancy ~ splines::bs(fertility, df = 13)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 13))

Coefficients:
                      (Intercept)   splines::bs(fertility, df = 13)1  
                          79.9713                             0.2178  
 splines::bs(fertility, df = 13)2   splines::bs(fertility, df = 13)3  
                          -4.1261                            -2.5263  
 splines::bs(fertility, df = 13)4   splines::bs(fertility, df = 13)5  
                          -3.3066                            -0.2613  
 splines::bs(fertility, df = 13)6   splines::bs(fertility, df = 13)7  
                          -9.6851                            -5.9818  
 splines::bs(fertility, df = 13)8   splines::bs(fertility, df = 13)9  
                         -12.6963                           -17.6809  
splines::bs(fertility, df = 13)10  splines::bs(fertility, df = 13)11  
                         -14.3569                           -28.2957  
splines::bs(fertility, df = 13)12  splines::bs(fertility, df = 13)13  
                         -10.4471                           -21.6721  

The knot location for \(K=1\) is given by:

test1<-eval(attr(fm1$terms, "predvars"))
(g1<-as.numeric(attr(test1[[2]],"knots")))
[1] 2.42

For \(K=2\) knots:

test2<-eval(attr(fm2$terms, "predvars"))
(g2<-as.numeric(attr(test2[[2]],"knots")))
[1] 1.90 3.15

And for \(K=10\) knots:

test10<-eval(attr(fm10$terms, "predvars"))
(g10<-as.numeric(attr(test10[[2]],"knots")))
 [1] 1.45 1.53 1.83 2.00 2.31 2.53 2.93 3.64 4.73 5.09

The cubic splines are displayed below:

plot1 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::bs(x, df = 3)) +
  ggplot2::ggtitle("Cubic Spline - K=0") + 
  ggplot2::theme_bw()

plot2 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::bs(x, df = 4)) +
  ggplot2::geom_vline(xintercept = g1, colour = "deepskyblue") +
  ggplot2::ggtitle("Cubic Spline - K=1") + 
  ggplot2::theme_bw()

plot3 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::bs(x, df = 5)) +
  ggplot2::geom_vline(xintercept = g2, colour = "deepskyblue") +
  ggplot2::ggtitle("Cubic Spline - K=2") + 
  ggplot2::theme_bw()

plot4 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::bs(x, df = 13)) +
  ggplot2::geom_vline(xintercept = g10, colour = "deepskyblue") +
  ggplot2::ggtitle("Cubic Spline - K=10") + 
  ggplot2::theme_bw()

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

Natural cubic splines are modeled using the splines package ns() function; the knots are, again, placed uniformly at quantiles of the predictor variable \(X\), based on their number (the knot locations are thus the same as in the cubic spline case).

The syntax for the OLS model formula in R follows the form

response ~ splines::ns(predictor, df)

where the degrees of freedom df are linked to the number of parameters to estimate (in the case of natural cubic spline, \(\text{df}=K+1\)).

We start by building a natural cubic spline with \(K=0\) knot (so \(K+1=1\) degrees of freedom).

(fm0 <- lm(life_expectancy ~ splines::ns(fertility, df = 1)))

Call:
lm(formula = life_expectancy ~ splines::ns(fertility, df = 1))

Coefficients:
                   (Intercept)  splines::ns(fertility, df = 1)  
                         78.09                          -34.27  

We can also build natural cubic splines with \(K=1\) knot:

(fm1 <- lm(life_expectancy ~ splines::ns(fertility, df = 2)))

Call:
lm(formula = life_expectancy ~ splines::ns(fertility, df = 2))

Coefficients:
                    (Intercept)  splines::ns(fertility, df = 2)1  
                          80.05                           -31.54  
splines::ns(fertility, df = 2)2  
                         -16.94  

Similarly, for \(K=2\) knots:

(fm2 <- lm(life_expectancy ~ splines::ns(fertility, df = 3)))

Call:
lm(formula = life_expectancy ~ splines::ns(fertility, df = 3))

Coefficients:
                    (Intercept)  splines::ns(fertility, df = 3)1  
                          78.11                           -19.86  
splines::ns(fertility, df = 3)2  splines::ns(fertility, df = 3)3  
                         -20.04                           -16.41  

And for \(K=10\) knots:

(fm10 <- lm(life_expectancy ~ splines::ns(fertility, df = 11)))

Call:
lm(formula = life_expectancy ~ splines::ns(fertility, df = 11))

Coefficients:
                      (Intercept)   splines::ns(fertility, df = 11)1  
                          80.5276                            -3.5824  
 splines::ns(fertility, df = 11)2   splines::ns(fertility, df = 11)3  
                          -3.7199                            -0.8336  
 splines::ns(fertility, df = 11)4   splines::ns(fertility, df = 11)5  
                         -10.3832                            -6.2447  
 splines::ns(fertility, df = 11)6   splines::ns(fertility, df = 11)7  
                         -14.0084                           -16.4080  
 splines::ns(fertility, df = 11)8   splines::ns(fertility, df = 11)9  
                         -17.3403                           -21.2666  
splines::ns(fertility, df = 11)10  splines::ns(fertility, df = 11)11  
                         -24.4461                           -17.1167  

The natural cubic splines are displayed below:

plot1 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::ns(x, df = 2)) +
  ggplot2::ggtitle("Natural Cubic Spline - K=0") + 
  ggplot2::theme_bw()

plot2 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::ns(x, knots = g1, df = 2)) +
  ggplot2::geom_vline(xintercept = g1, colour = "deepskyblue") +
  ggplot2::ggtitle("Natural Cubic Spline - K=1") + 
  ggplot2::theme_bw()

plot3 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::ns(x, knots = g2, df = 3)) +
  ggplot2::geom_vline(xintercept = g2, colour = "deepskyblue") +
  ggplot2::ggtitle("Natural Cubic Spline - K=2") + 
  ggplot2::theme_bw()

plot4 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + 
  ggplot2::geom_point(color='red', alpha=0.3) + 
  ggplot2::geom_smooth(method="lm",
   formula=  y ~ splines::ns(x, knots = g10, df = 11)) +
  ggplot2::geom_vline(xintercept = g10, colour = "deepskyblue") +
  ggplot2::ggtitle("Natural Cubic Spline - K=10") + 
  ggplot2::theme_bw()

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

Do you notice any difference in the shape of the cubic splines vs. that of the natural cubic splines?

Cross-validation (again!) can be used to determine the optimal \(K\): compute the estimated \(\text{SSRes}\) for various \(K\) (10-fold CV, say), and pick the \(K^*\) that minimizes the error.

Regression splines often give better results than polynomial regression because they induce flexibility via a large number of parameters \(K\) with low polynomial degree \(d\leq 3\), rather than through high \(d\) of the latter (and the wild variability that such polynomials have, especially near the boundaries of the predictor’s range, as can be observed in the polynomial regression examples above.

Multivariate Adaptive Regression Splines

We can reduce the polynomial degree to \(d\leq 2\) without losing too much curve fitting accuracy by considering bases consisting of functions of the forms: \[1,(x-\xi_k)_{\pm},(x-\xi_{k_1})_{\pm}(x-\xi_{k_2})_{\pm},\] where \((x-t)_{\pm}\) is one of the two hinge functions \[\begin{aligned} (x-t)_+=\begin{cases}x-t & \text{if $x>t$} \\ 0 & \text{else}\end{cases} \\ (x-t)_-=\begin{cases}t-x & \text{if $x<t$} \\ 0 & \text{else}\end{cases} \end{aligned}\] For instance, \((x-1)_{\pm}\), \((x-1)_{+}(x-5)_+\), and \((x-1)_+(x-8)_-\) are shown below:

A few hinge functions

Figure 12.13: A few hinge functions

A multivariate adaptive regression spline (MARS) model is expressed as \[y_i=\sum_{k=1}^K\beta_kh_k(x_i)+\varepsilon_i,\quad i=1,\ldots,N,\] where \(h_k\) is either a constant function, a hinge function, or a product of hinge functions.

MARS iteratively adds terms to its model; once a stopping criterion is met, unwanted terms are removed. The model growth’s parallels the growth of tree-based models, which we will discuss in Module 13, and it has the same advantage that the knots are selected automatically.

Artificial Dataset Example

Let us take a look at a synthetic dataset, based off of: \[y=f(x)=\frac{\sin(\pi x)}{10}−\sqrt{x}+\exp(x/10)+\varepsilon,\] where \(\varepsilon\sim \mathcal{N}(0,0.04^2)\).

set.seed(1234)
fx=function(x){
  sin(pi*x)/10-sqrt(x)+exp(x/10)
  }

x=sort(runif(50, 0, 5))
noise=rnorm(50, 0, 0.04)
y=fx(x)+noise

plot(x, y, col=4)
x.vec=seq(0,6, length.out=100)
lines(x.vec, fx(x.vec), col="grey", lty=2)

We can fit the data using package mda’s mars() function, in R (but this is a licensed implementation of MARS). There is another implementation in the earth package: enhanced adaptive regression through hinges, or EARTH.

Let us use only functions of degree 1 (linear functions and linear hinges, but no interaction terms) for the time being:

MARS.1=mda::mars(x, y, degree=1)

The output is rather lenghty and is suppressed for readability:

  • $call provides the model;

  • $fitted.values contains the estimated values \(\hat{y}_i\);

  • $residuals contain the residuals \(\hat{y}_i−y_i\), and

  • $x gives the hinge functions used in the final model.

Let’s see how good a job MARS did:

plot(x, y, col=4, main="MARS with no interaction terms")
x.vec=seq(0,6, length.out=100)
lines(x.vec, fx(x.vec), col="grey", lty=2)
points(x, MARS.1$fitted.values, col=2, pch=16) 
abline(v=MARS.1$cuts[MARS.1$selected.terms[-1]], col="light grey")

Not bad, all things considered.

The EARTH output is identical, and would be obtained thus:

EARTH.1=earth::earth(x, y, degree=1) 
summary(EARTH.1)
Call: earth(x=x, y=y, degree=1)

             coefficients
(Intercept)   -0.16254461
h(1.3341-x)    0.70785002
h(x-1.3341)   -0.05502561
h(x-2.53653)  -0.27853205
h(x-3.38547)   0.37209809

Selected 5 of 6 terms, and 1 of 1 predictors
Termination condition: RSq changed by less than 0.001 at 6 terms
Importance: x
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 0.002341396    RSS 0.07871774    GRSq 0.9759754    RSq 0.9831798

What about interaction terms? In order for MARS or EARTH to consider such terms, we must first provide a second predictor.

xnew = x*x
data = data.frame(x,xnew,y)
EARTH.2=earth::earth(y ~ x + xnew , data=data, degree=2) 
summary(EARTH.2)
Call: earth(formula=y~x+xnew, data=data, degree=2)

               coefficients
(Intercept)     -0.21273261
h(1.43112-x)     0.68806424
h(x-2.62849)    -0.43057541
h(xnew-10.446)   0.05531043

Selected 4 of 6 terms, and 2 of 2 predictors
Termination condition: RSq changed by less than 0.001 at 6 terms
Importance: x, xnew
Number of terms at each degree of interaction: 1 3 (additive model)
GCV 0.00273995    RSS 0.09437758    GRSq 0.971886    RSq 0.9798336

What does the plot look like?

plot(x, y, col=4, main="MARS with interaction terms")
x.vec=seq(0,6, length.out=100)
points(x, EARTH.2$fitted.values, col=2, pch=16) 
abline(v=EARTH.2$cuts[EARTH.2$selected.terms[-1]], col="light grey") 

EARTH.2$cuts
                      x     xnew
(Intercept)    0.000000  0.00000
h(x-1.43112)   1.431116  0.00000
h(1.43112-x)   1.431116  0.00000
h(xnew-10.446) 0.000000 10.44602
h(10.446-xnew) 0.000000 10.44602
h(x-2.62849)   2.628488  0.00000

Can you spot the non-linear components?

Housing Dataset Example

In this section, we will analyze a housing dataset related to house selling prices in Ames, Iowa (modified from [236]).

We start by reading in the data:

dat.Housing=read.csv("Data/VE_Housing.csv", header=TRUE, stringsAsFactors = TRUE)
dim(dat.Housing)
[1] 1460   81

Next, we count the number of missing values for each variable, excluding those variables with complete rows.

missing=attributes(which(apply(is.na(dat.Housing), 2, sum)>0))$names
apply(is.na(dat.Housing[,missing]), 2, sum)
 LotFrontage        Alley   MasVnrType   MasVnrArea     BsmtQual     BsmtCond 
         259         1369            8            8           37           37 
BsmtExposure BsmtFinType1 BsmtFinType2   Electrical  FireplaceQu   GarageType 
          38           37           38            1          690           81 
 GarageYrBlt GarageFinish   GarageQual   GarageCond       PoolQC        Fence 
          81           81           81           81         1453         1179 
 MiscFeature 
        1406 

The housing dataset thus consists of \(n=1460\) observations with \(p=79\) predictors. There are two other variables: Id and SalePrice, representing the index variable and the response variable, respectively.217

Furtheremore, the variables

  • LotFrontage

  • Alley

  • FireplaceQu

  • PoolQC

  • Fence, and

  • MiscFeature

all have anywhere from 259 to 1406 missing observations. The proportions of missing values in these variables are probably too high for imputation (see Module 8 for details), so we elect to remove them from further analyses.

Also, note that the remaining major missing variables are all related to Garage and Basement, with corresponding variables missing for the same houses. Given that there are other variables associated with Garage and Basement, we suspect these variables will not play a crucial role in model building, and we make the decision to remove them from further analyses.

For the remaining three variables with missing values (MasVnrType, MasVnrArea, and Electrical), the number of missing observations are so small that we could easily

  • impute these values, or

  • perform list-wise deletion.

 
For the purposes of this example, we will select the latter options and delete all columns with missing values.

dat.Housing.new = dat.Housing[,!colnames(dat.Housing)%in%missing]
dim(dat.Housing.new)
[1] 1460   62

We also remove the index variable ID:

dat.Housing.new = subset(dat.Housing.new, select = -c(Id))
dim(dat.Housing.new)
[1] 1460   61

In order to evaluate the effectiveness of the eventual model (i.e., to have good predictive power without overfitting the data), we split the Housing dataset into training and testing sets. The model is then developed using the training set (i.e., optimized using a subset of data), and then later tested for its prediction power using the testing set.

We select roughly 80% of the observations (1160) to be part of the training set:

set.seed(1234) # for replicability
n.train=1160   
ind.train=sample(1:nrow(dat.Housing.new), n.train)

The training and testing sets are thus:

dat.train=dat.Housing.new[ind.train,]
dat.test=dat.Housing.new[-ind.train,]

We train EARTH (with interactions) on the training data:

EARTH.3 <- earth::earth(SalePrice~., data=dat.train, degree=2)  
summary(EARTH.3)
Call: earth(formula=SalePrice~., data=dat.train, degree=2)

                                        coefficients
(Intercept)                                317.95604
Exterior1stBrkFace                          18.17930
FoundationPConc                             34.63490
h(14442-LotArea)                            -0.00198
h(LotArea-14442)                             0.00048
h(7-OverallQual)                           -19.21820
h(OverallQual-7)                            35.79700
h(7-OverallCond)                            -5.03664
h(2004-YearBuilt)                           -0.69748
h(YearBuilt-2004)                            6.36449
h(1056-BsmtFinSF1)                          -0.02232
h(BsmtFinSF1-1056)                           0.02788
h(2121-TotalBsmtSF)                         -0.01953
h(TotalBsmtSF-2121)                         -0.05503
h(2898-GrLivArea)                           -0.04853
h(GrLivArea-2898)                           -0.31464
h(1-Fireplaces)                             -4.60511
h(Fireplaces-1)                              5.78404
h(642-GarageArea)                           -0.02169
NeighborhoodCrawfor * h(2004-YearBuilt)      0.36780
Condition1Norm * h(GrLivArea-2898)           0.15117
h(YearBuilt-2004) * SaleTypeNew              8.47065
ExterQualGd * h(GarageArea-642)              0.09463
FoundationPConc * h(2898-GrLivArea)         -0.01734
h(GrLivArea-2898) * SaleTypeWD               0.26916
h(OverallQual-7) * h(TotalBsmtSF-1777)       0.03112
h(OverallQual-7) * h(1777-TotalBsmtSF)      -0.03789
h(7-OverallQual) * h(2898-GrLivArea)         0.00738
h(OverallQual-7) * h(2898-GrLivArea)        -0.01228
h(OverallQual-7) * h(TotRmsAbvGrd-9)        27.01137
h(OverallQual-7) * h(9-TotRmsAbvGrd)         7.13483
h(7-OverallCond) * h(X1stFlrSF-1269)        -0.00974
h(7-OverallCond) * h(WoodDeckSF-316)        -0.03691
h(7-OverallCond) * h(316-WoodDeckSF)        -0.01602
h(2005-YearBuilt) * h(1056-BsmtFinSF1)       0.00030
h(YearBuilt-2005) * h(1056-BsmtFinSF1)      -0.00928

Selected 36 of 39 terms, and 19 of 188 predictors
Termination condition: RSq changed by less than 0.001 at 39 terms
Importance: OverallQual, GrLivArea, YearBuilt, SaleTypeWD, BsmtFinSF1, ...
Number of terms at each degree of interaction: 1 18 17
GCV 396.2527    RSS 392191.8    GRSq 0.9377484    RSq 0.9467931

and predict SalePrice on the testing data:

yhat.EARTH.3=predict(EARTH.3, dat.test)

We can evaluate the quality of the predictions on the testing set either by computing \(\text{MSE}_{\text{Te}}\) directly:

mean((yhat.EARTH.3- dat.test$SalePrice)^2)
[1] 628.3848

but this numerical value is more or less useless on its own. We get a better sense for the quality of the prediction on the testing set by comparing the actual SalePrice values to the EARTH predicted SalePrice values:

xlimit=ylimit=c(0,600)
plot(NA, col=2, xlim=xlimit, ylim=ylimit, 
     ylab="Predicted Price ($1,000)", 
     xlab="Actual Price ($1,000)", 
     main="MARS/EARTH SalePrice predictions (w column-wise deletion)")
abline(h=seq(0,600, length.out=13), v=seq(0,600, length.out=13), lty=2, col="grey")
abline(a=0, b=1)
points(dat.test$SalePrice, yhat.EARTH.3, col=2)

What do you think? Is the model likely to prove useful?

Smoothing Splines

Given a training set \(\text{Tr}\) with \(N\) observations, we have seen that regression splines use the following approach:

  1. identify \(K\) knots \(\xi_1, \ldots, \xi_K\);

  2. produce some basis functions \(\{b_1(x),\ldots,b_{K}(x)\}\), and

  3. use OLS to estimate the coefficients of \[Y_i=\beta_0+\beta_1b_1(X_i)+\cdots + \beta_Kb_k(X_i)+\varepsilon_i,\quad i=1,\ldots, N.\]

But we can use another mathematical approach in order to produce a spline. In general, we need to find a function \(g\) that provides a good fit for the available data; in other words, we are looking for a \(g\) for which \[\text{SSRes}=\sum_{i=1}^N(y_i-g(x_i))^2\quad\mbox{is "small".}\] But we also need \(g\) to be constrained, otherwise any smooth function interpolating \((x_y,y_i)\), \(i=1,\ldots,N\) would yield \(\text{SSRes}=0\), at the cost of severe overfitting and loss of interpretability, as in Figure 12.14.

A spline wit few constraint overfits the data (right).A spline wit few constraint overfits the data (right).

Figure 12.14: A spline wit few constraint overfits the data (right).

The flip side, of course, is that too many constraints can result in the data being underfit.

The smoothing spline approach seeks to solve the following problem: \[g_{\lambda}=\arg\min_h\Bigg\{\underbrace{\sum_{i=1}^N(y_i-h(x_i))^2}_{\text{SSRes loss}}+\lambda\underbrace{\int_{\Omega(X)}\!\!\!\!\left[h''(t)\right]^2\, dt}_{\text{penalty term}}\Bigg\},\] where \(\lambda\geq 0\) is a tuning parameter and \(\Omega(X)\) represents the range of the predictor \(X\).

The penalty term measures the roughness of the spline function \(h\); if \(h\) is quite “wiggly”, the penalty will be (relatively) large, and vice-versa (and similarly for \(g\)).218

When \(\lambda\to 0\), the penalty term has little effect, so we would expect \(g_{\lambda}\) to be “jumpy” in such cases and that it would interpolate the observations exactly, leading to overfitting.

When \(\lambda\to\infty\), the penalty term dominates and \(g_{\lambda}\) is a function for which \(\int[g_{\lambda}''(t)]^2dt\to 0\) over \(\Omega(X)\), so \(g\to\) linear OLS solution over \(\Omega(X)\), leading to underfitting.

As we have seen over and over again, the tuning parameter \(\lambda\) controls the bias-variance trade-off, expressed, in this case, as a battle between rigidity and model complexity.

The optimal smoothing spline \(g_{\lambda}\) is a natural cubic spline with a knot at every data point \(\xi_i=x_i\), \(i=1,\ldots, N\), with continuous \(0\)th, \(1\)st, \(2\)nd derivatives throughout the range \(\Omega(X)=[\min \xi_i,\max \xi_i]\), and is linear outside \(\Omega(X)\), but, importantly, it is not the one that would be obtained from building a regression spline, as it must also depend on the turning parameter \(\lambda\).

What is the best choice for \(\lambda\)? At first glance, this would seem to be another job for cross-validation, but there is another option: we can specify the smoothing spline through the effective degrees of freedom, which decrease from \(N\) to \(2\) as \(\lambda\) goes from \(0\) to \(\infty\) (note, however, that R’s smooth.spline() uses a different parameterization).

Gapminder Example

The charts below show the smoothing spline for life expectancy against fertility in the 2011 Gapminder data, for 4 different smoothing parameter values, using stats’s smooth.spline() function. Note that the entire set is used as training data.

x=gapminder.2011$fertility
y=gapminder.2011$life_expectancy

ss00 = stats::smooth.spline(x, y, spar=0)
ss05 = stats::smooth.spline(x, y, spar=0.5)
ss10 = stats::smooth.spline(x, y, spar=1)
ss15 = stats::smooth.spline(x, y, spar=1.5)

In order to be able to display the smoothing splines over the datapoints, we use the broom::augment() function, which provide the value of the spline at the various fertility values in the dataset.

plot1 <- ggplot2::ggplot(broom::augment(ss00, gapminder.2011),
                ggplot2::aes(x=fertility, y=life_expectancy)) +
  ggplot2::geom_point(colour="red") +
  ggplot2::geom_line(ggplot2::aes(y = .fitted), 
                     colour="blue", size=1.1) + 
  ggplot2::ggtitle("Smoothing Spline - spar=0") +
  ggplot2::theme_bw()

plot2 <- ggplot2::ggplot(broom::augment(ss05, gapminder.2011),
                ggplot2::aes(x=fertility, y=life_expectancy)) +
  ggplot2::geom_point(colour="red") +
  ggplot2::geom_line(ggplot2::aes(y = .fitted), 
                     colour="blue", size=1.1) + 
  ggplot2::ggtitle("Smoothing Spline - spar=0.5") +
  ggplot2::theme_bw()

plot3 <- ggplot2::ggplot(broom::augment(ss10, gapminder.2011),
                ggplot2::aes(x=fertility, y=life_expectancy)) +
  ggplot2::geom_point(colour="red") +
  ggplot2::geom_line(ggplot2::aes(y = .fitted), 
                     colour="blue", size=1.1) + 
  ggplot2::ggtitle("Smoothing Spline - spar=1") +
  ggplot2::theme_bw()

plot4 <- ggplot2::ggplot(broom::augment(ss15, gapminder.2011),
                ggplot2::aes(x=fertility, y=life_expectancy)) +
  ggplot2::geom_point(colour="red") +
  ggplot2::geom_line(ggplot2::aes(y = .fitted), 
                     colour="blue", size=1.1) + 
  ggplot2::ggtitle("Smoothing Spline - spar=1.5") +
  ggplot2::theme_bw()

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

Note the evolution of a flexible but non-interpretable model (the wiggly curve associated to spar=0) to a rigid but highly interpretable model (the line associated to spar=1.5) as the spar values increase.

12.5.3 Generalized Additive Models

While polynomial regression and splines can be applied to predictor sets, they are best-suited to predicting a response \(Y\) on the basis of a single predictor \(X\) (the model complexity increases quickly if more than one predictor is present).

Generalized additive models (GAM) allow for flexible non-linearities in several variables while retaining the additive structure of linear models: \[y_i=\beta_0+f_1(x_{i,1})+\cdots+f_p(x_{i,p})+\varepsilon_i,\quad i=1,\ldots,N\] where each of the \(f_j\) can be derived using any of the methods previously discussed; if \[\begin{aligned} f_1(x_i)&=\beta_{1,1}b_{1,1}(x_{i,1})+\cdots + \beta_{1,L_1}b_{1,L_1}(x_{i,1}) \\ &\ \vdots \\ f_p(x_i)&=\beta_{p,1}b_{p,1}(x_{i,p})+\cdots + \beta_{p,L_p}b_{p,L_p}(x_{i,p}),\end{aligned}\] say, we would fit the data using OLS (but this cannot be done if one of the components is a smoothing spline, for instance, or if it is non-linear in some other way).

In practice, using natural cubic splines for the quantitative components seem to work as well as smoothing spline, when it comes to making predictions.

GAM are implemented in R using the mgcv::gam() function; a typical call might look like:

library(mgcv)
gam(y~s(x1,df=5) + 
      lo(x2,spar=0.5)+ 
      bs(x3,df=4) +
      ns(x4,df=5):ns(x5,df=5) +
      x6, data=dat)

which would indicate that the contribution of:

  • \(X_1\) is given by smoothing spline with 5 degrees of freedom,

  • \(X_2\) is given by a local regression with spar=0.5,

  • \(X_3\) is given by a cubic spline with 4 degrees of freedom,

  • the fourth component is an interaction term based on natural splines for \(X_4\) and \(X_5\) (each with 5 degrees of freedom), and

  • \(X_6\) is directly added to the model.

GAM provide a useful compromise between linear models and fully non-parametric models.

Advantages:

  • GAM can fit a non-linear \(f_j\) to each predictor \(X_j\), so that they could capture trends that linear regression would miss;

  • GAM can reduce the number of data transformations to try out manually on each predictor \(X_j\);

  • non-linear fits may improve accuracy of predictions for the response \(Y\);

  • GAM are useful for inference due to their additivity – the effect of \(X_j\) on \(Y\) (while keeping other predictors fixed) can be analyzed separately;

  • the overall smoothness of the model can be summarized via effective degrees of freedom/parameters.

Disadvantages:

  • GAM still suffer from the curse of dimensionality;

  • GAM are restricted to additive models – interaction terms can be added manually by introducing new predictors \(X_j\times X_k\), as can interaction functions \(f_{j,k}(X_j,X_k)\) (using local regression or MARS, say), but they quickly get out of hand (due to CoD issues).

Example

The charts below show the individual contributions of fertility, infant mortality, GDP, and continental membership to life expectancy in the 2011 Gapminder data (using the entire set as training data).

library(mgcv)
b<-gam(gapminder.2011$life_expectancy ~  s(gapminder.2011$fertility) + 
               s(gapminder.2011$infant_mortality) +
               s(gapminder.2011$gdp) + 
               gapminder.2011$continent)
summary(b)

Family: gaussian 
Link function: identity 

Formula:
gapminder.2011$life_expectancy ~ s(gapminder.2011$fertility) + 
    s(gapminder.2011$infant_mortality) + s(gapminder.2011$gdp) + 
    gapminder.2011$continent

Parametric coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       68.1186     0.7470  91.190  < 2e-16 ***
gapminder.2011$continentAmericas   4.4787     1.1161   4.013 9.30e-05 ***
gapminder.2011$continentAsia       4.7110     0.9993   4.714 5.35e-06 ***
gapminder.2011$continentEurope     3.4179     1.3209   2.588   0.0106 *  
gapminder.2011$continentOceania   -0.2891     1.3798  -0.210   0.8343    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                     edf Ref.df      F p-value    
s(gapminder.2011$fertility)        1.000  1.000  4.474   0.036 *  
s(gapminder.2011$infant_mortality) 3.027  3.800 40.541  <2e-16 ***
s(gapminder.2011$gdp)              1.478  1.779  0.367   0.575    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.828   Deviance explained = 83.8%
GCV = 13.199  Scale est. = 12.363    n = 166

We see in the outcome that the intercept is \(\beta_0=68.1186\) and that \[\beta_{\text{continent}}=\begin{cases} 0 & \text{Africa} \\ 4.4787 & \text{America} \\ 4.7110 & \text{Asia} \\ 3.4179 & \text{Europe} \\ -0.2891 & \text{Oceania} \\ \end{cases} \] so that predictions take the form \[\begin{aligned} \text{life expectancy}&\approx \beta_0 + f_1(\text{fertility})+f_2(\text{infant mortality}) \\ & \ \ \ \ + f_3(\text{gdp}) + \beta_{\text{continent}}.\end{aligned}\]

plot.gam(b)

For instance, the life expectancy for an American country with fertility\(=3\), infant mortality\(=1\), GDP\(=6\times 10^{12}\) would be approximately \[68.1+0+10+2+4.5=84.6.\]

Take the time to read the mgcv and the gam documentation to better understand how these work in practice (in particular, how to make predictions on test/new observations).

References

[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[236]
B. Boehmke and B. Greenwell, Hands on Machine Learning with R. CRC Press.