12.5 Nonlinear Modeling

In practice the linearity assumption is almost never met and the regression function $y=f(\mathbf{x})=\text{E}[Y\mid \vec{X}=\mathbf{x}]$ has to be approximated by some other technique. Or does it?

The linearity assumption is often “good enough” in spite of it not being met, and, coupled with its convenience of use and its multiple extensions, it is rarely a waste of time to give that approach a try.

When heavier machinery is required, it pays to consider the following OLS generalizations, which offer a lot of flexibility without sacrificing ease of interpretability, before jumping to so-called black box models (SVM, ANN, ensemble learning, etc.) of Module 13:

• curve fitting (polynomial regression, step functions, splines, etc.);

• local regression methods, or

12.5.1 Basis Function Models

If we have reason to suspect that the response $$Y$$ is not a linear combination of the predictors, we might benefit from using a derived set of predictors (see [3, Sec. 7.3]).

Polynomial Regression

We can extend the simple linear model $$y_i=\beta_0+\beta_1x_i+\varepsilon_i$$, $$i=1,\ldots, N$$, by allowing for polynomial basis terms in the regression function: $y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\cdots+\beta_dx_i^d+\varepsilon_i,\quad i=1,\ldots, N.$ The regression function is non-linear in terms of the observations $$x_i$$, but it is linear in terms of the coefficients $$\beta_j$$ (or in terms of the predictors $$\{x,x^2,\ldots,x^d\}$$). We thus create new variables $$X_1=X$$, $$X_2=X^2$$, and so on, and estimate the regression function $$y=f(\mathbf{x})$$ via $$\hat{f}(\mathbf{x})=\mathbf{x}^{\!\top}\mathbf{\hat{\beta}}$$, where the coefficients $$\mathbf{\hat{\beta}}$$ are learned using the training set $$\text{Tr}$$.

Typically, the coefficient values are of little interest – it is the predictions $$\hat{f}(\mathbf{\tilde{x}})$$ that are sought.

It is easy to obtain and estimate for $$\text{Var}(\hat{f}(\mathbf{\tilde{x}}))$$ since $$\hat{f}(\mathbf{\tilde{x}})$$ is linear in the coefficients $$\hat{\beta}_i$$, $$i=0,\ldots, d$$: \begin{aligned} \text{Var}(\hat{f}(\mathbf{\tilde{x}}))&=\text{Var}(\mathbf{\tilde{x}}^{\!\top}\mathbf{\hat{\beta}})=\sum_{i,j=0}^d\text{Cov}(\hat{\beta}_i\tilde{x}_i,\hat{\beta}_j\tilde{x}_j)\\&=\sum_{i,j=0}^d\tilde{x}_i\tilde{x}_j\text{Cov}(\hat{\beta}_i,\hat{\beta}_j)=\mathbf{\tilde{x}}^{\!\top}\text{Cov}(\mathbf{\hat{\beta}})\mathbf{\tilde{x}}\\&=\sigma^2\mathbf{\tilde{x}}^{\!\top}(\mathbf{\mathbf{X}^{\!\top}\mathbf{X}})^{-1}\mathbf{\tilde{x}}.\end{aligned} The estimated variance of the approximation at $$\mathbf{\tilde{x}}$$ is thus $\hat{\text{Var}}(\hat{f}(\mathbf{\tilde{x}}))= \frac{\text{SSRes}}{n-d-1}\mathbf{\tilde{x}}^{\!\top}(\mathbf{\mathbf{X}^{\!\top}\mathbf{X}})^{-1}\mathbf{\tilde{x}}= \frac{\|\mathbf{Y}-\mathbf{X}\mathbf{\hat{\beta}}\|_2^2}{n-d-1}\mathbf{\tilde{x}}^{\!\top}(\mathbf{\mathbf{X}^{\!\top}\mathbf{X}})^{-1}\mathbf{\tilde{x}},$ with $$\text{se}(\hat{f}(\mathbf{\tilde{x}}))=\sqrt{\hat{\text{Var}}(\hat{f}(\mathbf{\tilde{x}}))}$$, so that $\hat{f}(\mathbf{\tilde{x}})\pm 2 \text{se}(\hat{f}(\mathbf{\tilde{x}}))$ constitutes a 95% $$\text{C.I.}$$ for $$\hat{f}(\mathbf{\tilde{x}})$$, assuming normality of the error terms.

Gapminder Example

The charts below show polynomial regressions ($$d=4$$) and confidence intervals for life expectancy against 4 different predictors in the 2011 Gapminder data (assuming that the training set $$\text{Tr}$$ is the entire dataset).

par(mfrow=c(2,2))
plot1 <- ggplot2::ggplot(gapminder.2011, ggplot2::aes(x=fertility, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) +
ggplot2::ggtitle("Polynomial regression - d=4") +
ggplot2::theme_bw()

plot2 <- ggplot2::ggplot(gapminder.2011, ggplot2::aes(x=infant_mortality, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) +
ggplot2::ggtitle("Polynomial regression - d=4") +
ggplot2::theme_bw()

plot3 <- gapminder.2011 |> dplyr::mutate(lgdppc=log(gdp/population)) |>
ggplot2::ggplot(ggplot2::aes(x=lgdppc, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) +
ggplot2::ggtitle("Polynomial regression - d=4") +
ggplot2::theme_bw()

plot4 <- ggplot2::ggplot(gapminder.2011, ggplot2::aes(x=gdp, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_smooth(method='lm', formula = y~poly(x,4)) +
ggplot2::ggtitle("Polynomial regression - d=4") +
ggplot2::theme_bw()

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

In this example, we picked $$d=4$$. How do we select $$d$$, in general? We can either pick a reasonable small $$d$$ (often below 4) or use cross-validation to select a $$d$$ that minimizes the estimated $$\text{MSE}_\text{Te}$$.

Note that it is easy to incorporate more than one predictor and interaction terms into the model.

The nature of polynomials ($$|f(\mathbf{x})|\to \infty$$ when $$\|\mathbf{x}\|\to\infty$$) is such that tail behaviour is usually quite horrible (look at the bottom-right example above). Consequently, polynomial regression should be used very carefully, staying within the domain and making sure to centre the predictors to reduce variance inflation.

Step Functions

Polynomial regression is an attractive approach because of the ease with which we can use the apparatus of OLS, but the elephant in the room is that we are imposing a global structure on the non-linear function $$y=f(\mathbf{x})$$, and that cannot always be justified.

Step functions can be used to keep things “local”. Let $$c_i$$, $$i=1,\ldots, K$$ lie in $$\text{range}(X)$$ and consider the following $$K+1$$ new predictors: \begin{aligned} C_0(X)&=\mathcal{I}(X<c_1) \\ C_i(X)&=\mathcal{I}(c_i\leq X<c_{i+1}),\ i=1,\ldots, K-1 \\ C_K(X)&=\mathcal{I}(c_k\leq X),\end{aligned} where $$\mathcal{I}$$ is the indicator function $\mathcal{I}(\alpha)=\begin{cases}0, & \text{\alpha is false} \\ 1, & \text{\alpha is true} \end{cases}$ For any $$X$$, $$C_0(X)+C_1(X)+\cdots+C_K(X)=1$$, since $$X$$ lies in exactly one of the intervals $(-\infty,c_1),\ [c_1,c_2),\ \cdots,\ [c_{K-1},C_K) ,\ [C_K,\infty).$ The step function regression model is $Y_i=\beta_0+\beta_1C_1(X_i)+\cdots+\beta_KC_K(X_i)+\varepsilon_i,\quad i=1,\ldots,N;$ it can also be obtained using the OLS framework.216

For a given $$X$$, at most one of $$C_1(X),\ldots,C_K(X)$$ is $$\neq 0$$; thus, when $$X<c_1$$, $$C_j(X)=0$$ for all $$j=1,\ldots, K$$, and so $\beta_0=\text{Avg}[Y\mid X<c_1].$ For $$X\in [c_j,c_{j+1})$$, $$\hat{y}=\beta_0+\beta_j$$, so $$\beta_j$$ represents the average increase in $$Y$$ for $$[c_j,c_{j+1})$$ relative to $$(-\infty,c_1)$$.

The only major challenge with step function regression is that there is no easy way to find the number $$K$$ and select the position of the breakpoints $$c_1,\ldots,c_K$$, unless there are natural gaps in the predictors. We will discuss a strategy to determine the number and location of knots when we discuss classification and regression trees in Module 13.

We did not discuss how how step function regression or polynomial regression could be achieved in higher dimensions, but the principle remains the same (except that the number of parameters increases drastically, which can create some overfitting issues).

Gapminder Example

The charts below show step function regressions and confidence intervals for life expectancy against 4 different predictors in the 2011 Gapminder data (assuming that the training set $$\text{Tr}$$ is the entire dataset).

We start by building a $$K=3$$ knots step function model for life expectancy against fertility, using the (arbitrary) knot values at 2, 4, and 6:

gapminder.2011 <- gapminder.2011 |>
dplyr::mutate(fert0=I(fertility<2),
fert1=I(2<=fertility & fertility<4),
fert2=I(4<=fertility & fertility<6),
fert3=I(6<=fertility))
model.sf.1 = lm(life_expectancy~fert0+fert1+fert2,data=gapminder.2011)
summary(model.sf.1)

Call:
lm(formula = life_expectancy ~ fert0 + fert1 + fert2, data = gapminder.2011)

Residuals:
Min       1Q   Median       3Q      Max
-24.0485  -2.8300   0.2515   3.9669  12.1515

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  60.5444     1.9554  30.963  < 2e-16 ***
fert0TRUE    16.8106     2.0969   8.017 2.04e-13 ***
fert1TRUE    10.2040     2.0845   4.895 2.36e-06 ***
fert2TRUE     0.7814     2.2212   0.352    0.725
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.866 on 162 degrees of freedom
Multiple R-squared:  0.5308,    Adjusted R-squared:  0.5221
F-statistic: 61.09 on 3 and 162 DF,  p-value: < 2.2e-16

The corresponding step function is defined with:

g.1 <- function(x){
model.sf.1$coefficients[1] + model.sf.1$coefficients[2]*I(x<2) +
model.sf.1$coefficients[3]*I(2<=x & x<4) + model.sf.1$coefficients[4]*I(6<=x)
}

We next build a $$K=4$$ knots step function model for life expectancy against infant mortality, using the (arbitrary) knot values 10, 20, 40, 70:

gapminder.2011 <- gapminder.2011 |>
dplyr::mutate(inf0=I(infant_mortality<10),
inf1=I(10<=infant_mortality & infant_mortality<20),
inf2=I(20<=infant_mortality & infant_mortality<40),
inf3=I(40<=infant_mortality & infant_mortality<70),
inf4=I(70<=infant_mortality))
model.sf.2 = lm(life_expectancy~inf0+inf1+inf2+inf3,data=gapminder.2011)
summary(model.sf.2)

g.2 <- function(x){
model.sf.2$coefficients[1] + model.sf.2$coefficients[2]*I(x<10) +
model.sf.2$coefficients[3]*I(10<=x & x<20) + model.sf.2$coefficients[4]*I(20<=x & x<40) +
model.sf.2$coefficients[5]*I(40<=x & x<70) }  Call: lm(formula = life_expectancy ~ inf0 + inf1 + inf2 + inf3, data = gapminder.2011) Residuals: Min 1Q Median 3Q Max -13.9800 -2.3800 0.3725 2.7622 9.3200 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 56.675 1.207 46.939 < 2e-16 *** inf0TRUE 22.017 1.340 16.437 < 2e-16 *** inf1TRUE 17.963 1.390 12.928 < 2e-16 *** inf2TRUE 11.782 1.429 8.247 5.46e-14 *** inf3TRUE 5.305 1.399 3.791 0.000211 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.183 on 161 degrees of freedom Multiple R-squared: 0.763, Adjusted R-squared: 0.7571 F-statistic: 129.5 on 4 and 161 DF, p-value: < 2.2e-16 We next build a $$K=3$$ knots step function model for life expectancy against the log of gdp per capita, using the (arbitrary) knot values at 6, 8, 10: gapminder.2011 <- gapminder.2011 |> dplyr::mutate(lgdppc=log(gdp/population)) |> dplyr::mutate(lgdppc0=I(lgdppc<6), lgdppc1=I(6<=lgdppc & lgdppc<8), lgdppc2=I(8<=lgdppc & lgdppc<10), lgdppc3=I(10<=lgdppc)) model.sf.3 = lm(life_expectancy~lgdppc0+lgdppc1+lgdppc2,data=gapminder.2011) summary(model.sf.3) g.3 <- function(x){ model.sf.3$coefficients[1] +
model.sf.3$coefficients[2]*I(x<6) + model.sf.3$coefficients[3]*I(6<=x & x<8) +
model.sf.3$coefficients[4]*I(8<=x & x<10) }  Call: lm(formula = life_expectancy ~ lgdppc0 + lgdppc1 + lgdppc2, data = gapminder.2011) Residuals: Min 1Q Median 3Q Max -21.771 -1.831 0.550 3.691 9.789 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 81.100 1.270 63.857 < 2e-16 *** lgdppc0TRUE -20.788 1.708 -12.174 < 2e-16 *** lgdppc1TRUE -12.629 1.453 -8.692 3.75e-15 *** lgdppc2TRUE -6.012 1.509 -3.984 0.000102 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.82 on 162 degrees of freedom Multiple R-squared: 0.5382, Adjusted R-squared: 0.5296 F-statistic: 62.93 on 3 and 162 DF, p-value: < 2.2e-16 Finally, we build a $$K=6$$ knots step function model for life expectancy against the log of gdp per capita, using the (arbitrary) knot values at 5, 6, 7, 8, 9, 10: gapminder.2011 <- gapminder.2011 |> dplyr::mutate(lgdppc=log(gdp/population)) |> dplyr::mutate(lgdppc0=I(lgdppc<5), lgdppc1=I(5<=lgdppc & lgdppc<6), lgdppc2=I(6<=lgdppc & lgdppc<7), lgdppc3=I(7<=lgdppc & lgdppc<8), lgdppc4=I(8<=lgdppc & lgdppc<9), lgdppc5=I(9<=lgdppc & lgdppc<10), lgdppc6=I(10<=lgdppc)) model.sf.4 = lm(life_expectancy~lgdppc0+lgdppc1+lgdppc2+lgdppc3+lgdppc4+lgdppc5,data=gapminder.2011) summary(model.sf.4) g.4 <- function(x){ model.sf.4$coefficients[1] +
model.sf.4$coefficients[2]*I(x<5) + model.sf.4$coefficients[3]*I(5<=x & x<6) +
model.sf.4$coefficients[4]*I(6<=x & x<7) + model.sf.4$coefficients[5]*I(7<=x & x<8) +
model.sf.4$coefficients[6]*I(8<=x & x<9) + model.sf.4$coefficients[7]*I(9<=x & x<10)
}

Call:
lm(formula = life_expectancy ~ lgdppc0 + lgdppc1 + lgdppc2 +
lgdppc3 + lgdppc4 + lgdppc5, data = gapminder.2011)

Residuals:
Min       1Q   Median       3Q      Max
-22.8250  -1.3500   0.5964   3.1841  12.2929

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   81.100      1.204  67.367  < 2e-16 ***
lgdppc0TRUE  -21.300      4.082  -5.217 5.59e-07 ***
lgdppc1TRUE  -20.746      1.648 -12.585  < 2e-16 ***
lgdppc2TRUE  -15.993      1.593 -10.042  < 2e-16 ***
lgdppc3TRUE  -10.275      1.487  -6.912 1.10e-10 ***
lgdppc4TRUE   -7.187      1.559  -4.610 8.24e-06 ***
lgdppc5TRUE   -4.190      1.724  -2.431   0.0162 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.517 on 159 degrees of freedom
Multiple R-squared:  0.5927,    Adjusted R-squared:  0.5774
F-statistic: 38.57 on 6 and 159 DF,  p-value: < 2.2e-16

The step functions in each of the 4 cases are displayed below:

plot1 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_function(fun=g.1) +
ggplot2::ggtitle("Step Function Regression - K=3") +
ggplot2::theme_bw()

plot2 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=infant_mortality, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_function(fun=g.2) +
ggplot2::ggtitle("Step Function Regression - K=4") +
ggplot2::theme_bw()

plot3 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=lgdppc, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_function(fun=g.3) +
ggplot2::ggtitle("Step Function Regression - K=3") +
ggplot2::theme_bw()

plot4 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=lgdppc, y=life_expectancy)) +
ggplot2::geom_point(color='red', alpha=0.3) +
ggplot2::stat_function(fun=g.4) +
ggplot2::ggtitle("Step Function Regression - K=6") +
ggplot2::theme_bw()

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2)

12.5.2 Splines

We can combine polynomial regression and step functions to obtain a more flexible curve fitting approach.

Regression Splines

Instead of fitting a polynomial over the entire range of the predictor $$X$$, we use different polynomials (of degree up to $$3$$, usually) in regions $$R_k$$ (defined by knots in the 1-dimensional case), such as: $Y_i=\begin{cases} \beta_{0,1}+\beta_{1,1}X_{i}+\beta_{2,1}X_i^2+\beta_{3,1}X_i^3 +\varepsilon_i, \quad\text{if X_i\in R_1} \\ \beta_{0,2}+\beta_{1,2}X_{i}+\beta_{2,2}X_i^2+\beta_{3,2}X_i^3 +\delta_i, \quad\text{if X_i\in R_2} \end{cases}$ Various constraints can be imposed on the polynomials:

• none;

• continuity at each region’s borders;

• $$C^1$$ (continuously differentiable) at each region’s borders;

• etc.

In a sense to be defined shortly, splines have the “maximum” amount of continuity. Note that using more regions leads to a more flexible fit.

In what follows, we assume that the domain is split into $$K+1$$ regions, bounded by knots (there are thus $$K$$ such knots). If we impose no restriction on the functions, we are trying to fit $$K+1$$ piecewise cubic functions to the data; each polynomial has 4 parameters to be estimated, leading to $$4(K+1)$$ effective parameters.

If we impose a continuous fit (the polynomials must agree at the knots), we reduce the number of effective parameters. We can also require a continuously differentiable fit (the derivatives must also agree at the knots), further reducing the number of effective parameters.

A cubic spline (with only $$K+4$$ parameters to fit) is a regression spine which is $$C^2$$ on its domain.

Let $$\xi$$ be a knot and $$x$$ be a predictor value. The positive part function is defined by $w_+=\begin{cases}w & \text{if w>0} \\ 0 & \text{else}\end{cases}$ Formally, the linear spline requires $$\xi_1,\ldots,\xi_K$$ knots and has $$K+1$$ effective parameters. The model can be expressed simply using positive parts: \begin{aligned} y_i&=\beta_0+\beta_1x_i+\beta_2(x_i-\xi_1)_++\cdots+\beta_{K+1}(x_i-\xi_K)_++\varepsilon_i;\end{aligned} the cubic spline is: \begin{aligned} y_i&=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3 +\beta_4(x_i-\xi_1)_+^3+\cdots+\beta_{K+3}(x_i-\xi_K)_+^3+\varepsilon_i,\end{aligned} and the natural cubic spline is a cubic spline between $$\xi_1$$ and $$\xi_K$$, with linear extrapolation beyond $$\xi_1$$ and $$\xi_K$$; this adds $$4$$ extra constraints to the cubic spline and allows for more knots while keeping the number of effective parameters identical to that of the linear spline.

In all instances, the machinery of OLS remains available: predictions, diagnostics, remedial measures, confidence intervals, and extension to logistic regression, as needed.

Gapminder Example

The charts below show cubic splines for life expectancy against fertility in the 2011 Gapminder data (assuming that the training set $$\text{Tr}$$ is the entire dataset).

Cubic splines are modeled using the splines package bs() function. In theory, we place more knots in locations where the spline functions is believed to vary more rapidly, and fewer knots where it is more stable. In practice, the knots are placed uniformly at quantiles of the predictor variable $$X$$, based on their number.

The syntax for the OLS model formula in R follows the form

response ~ splines::bs(predictor, df)

where the degrees of freedom df are linked to the number of parameters to estimate (in the case of cubic spline, $$\text{df}=K+3$$).

We start by building a cubic spline with $$K=0$$ knot (so $$K+3=3$$ degrees of freedom).

(fm0 <- lm(life_expectancy ~ splines::bs(fertility, df = 3)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 3))

Coefficients:
(Intercept)  splines::bs(fertility, df = 3)1
79.28                           -11.47
splines::bs(fertility, df = 3)2  splines::bs(fertility, df = 3)3
-27.41                           -16.65  

We can also build cubic splines with $$K=1$$ knot:

(fm1 <- lm(life_expectancy ~ splines::bs(fertility, df = 4)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 4))

Coefficients:
(Intercept)  splines::bs(fertility, df = 4)1
76.772                            3.042
splines::bs(fertility, df = 4)2  splines::bs(fertility, df = 4)3
-17.886                          -17.515
splines::bs(fertility, df = 4)4
-16.842  

Similarly, for $$K=2$$ knots:

(fm2 <- lm(life_expectancy ~ splines::bs(fertility, df = 5)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 5))

Coefficients:
(Intercept)  splines::bs(fertility, df = 5)1
78.395                           -1.110
splines::bs(fertility, df = 5)2  splines::bs(fertility, df = 5)3
-1.318                          -22.715
splines::bs(fertility, df = 5)4  splines::bs(fertility, df = 5)5
-16.188                          -19.721  

And for $$K=10$$ knots:

(fm10 <- lm(life_expectancy ~ splines::bs(fertility, df = 13)))

Call:
lm(formula = life_expectancy ~ splines::bs(fertility, df = 13))

Coefficients:
(Intercept)   splines::bs(fertility, df = 13)1
79.9713                             0.2178
splines::bs(fertility, df = 13)2   splines::bs(fertility, df = 13)3
-4.1261                            -2.5263
splines::bs(fertility, df = 13)4   splines::bs(fertility, df = 13)5
-3.3066                            -0.2613
splines::bs(fertility, df = 13)6   splines::bs(fertility, df = 13)7
-9.6851                            -5.9818
splines::bs(fertility, df = 13)8   splines::bs(fertility, df = 13)9
-12.6963                           -17.6809
splines::bs(fertility, df = 13)10  splines::bs(fertility, df = 13)11
-14.3569                           -28.2957
splines::bs(fertility, df = 13)12  splines::bs(fertility, df = 13)13
-10.4471                           -21.6721  

The knot location for $$K=1$$ is given by:

test1<-eval(attr(fm1$terms, "predvars")) (g1<-as.numeric(attr(test1[[2]],"knots"))) [1] 2.42 For $$K=2$$ knots: test2<-eval(attr(fm2$terms, "predvars"))
(g2<-as.numeric(attr(test2[[2]],"knots")))
[1] 1.90 3.15

And for $$K=10$$ knots:

test10<-eval(attr(fm10terms, "predvars")) (g10<-as.numeric(attr(test10[[2]],"knots")))  [1] 1.45 1.53 1.83 2.00 2.31 2.53 2.93 3.64 4.73 5.09 The cubic splines are displayed below: plot1 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::bs(x, df = 3)) + ggplot2::ggtitle("Cubic Spline - K=0") + ggplot2::theme_bw() plot2 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::bs(x, df = 4)) + ggplot2::geom_vline(xintercept = g1, colour = "deepskyblue") + ggplot2::ggtitle("Cubic Spline - K=1") + ggplot2::theme_bw() plot3 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::bs(x, df = 5)) + ggplot2::geom_vline(xintercept = g2, colour = "deepskyblue") + ggplot2::ggtitle("Cubic Spline - K=2") + ggplot2::theme_bw() plot4 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::bs(x, df = 13)) + ggplot2::geom_vline(xintercept = g10, colour = "deepskyblue") + ggplot2::ggtitle("Cubic Spline - K=10") + ggplot2::theme_bw() gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2) Natural cubic splines are modeled using the splines package ns() function; the knots are, again, placed uniformly at quantiles of the predictor variable $$X$$, based on their number (the knot locations are thus the same as in the cubic spline case). The syntax for the OLS model formula in R follows the form response ~ splines::ns(predictor, df) where the degrees of freedom df are linked to the number of parameters to estimate (in the case of natural cubic spline, $$\text{df}=K+1$$). We start by building a natural cubic spline with $$K=0$$ knot (so $$K+1=1$$ degrees of freedom). (fm0 <- lm(life_expectancy ~ splines::ns(fertility, df = 1)))  Call: lm(formula = life_expectancy ~ splines::ns(fertility, df = 1)) Coefficients: (Intercept) splines::ns(fertility, df = 1) 78.09 -34.27  We can also build natural cubic splines with $$K=1$$ knot: (fm1 <- lm(life_expectancy ~ splines::ns(fertility, df = 2)))  Call: lm(formula = life_expectancy ~ splines::ns(fertility, df = 2)) Coefficients: (Intercept) splines::ns(fertility, df = 2)1 80.05 -31.54 splines::ns(fertility, df = 2)2 -16.94  Similarly, for $$K=2$$ knots: (fm2 <- lm(life_expectancy ~ splines::ns(fertility, df = 3)))  Call: lm(formula = life_expectancy ~ splines::ns(fertility, df = 3)) Coefficients: (Intercept) splines::ns(fertility, df = 3)1 78.11 -19.86 splines::ns(fertility, df = 3)2 splines::ns(fertility, df = 3)3 -20.04 -16.41  And for $$K=10$$ knots: (fm10 <- lm(life_expectancy ~ splines::ns(fertility, df = 11)))  Call: lm(formula = life_expectancy ~ splines::ns(fertility, df = 11)) Coefficients: (Intercept) splines::ns(fertility, df = 11)1 80.5276 -3.5824 splines::ns(fertility, df = 11)2 splines::ns(fertility, df = 11)3 -3.7199 -0.8336 splines::ns(fertility, df = 11)4 splines::ns(fertility, df = 11)5 -10.3832 -6.2447 splines::ns(fertility, df = 11)6 splines::ns(fertility, df = 11)7 -14.0084 -16.4080 splines::ns(fertility, df = 11)8 splines::ns(fertility, df = 11)9 -17.3403 -21.2666 splines::ns(fertility, df = 11)10 splines::ns(fertility, df = 11)11 -24.4461 -17.1167  The natural cubic splines are displayed below: plot1 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::ns(x, df = 2)) + ggplot2::ggtitle("Natural Cubic Spline - K=0") + ggplot2::theme_bw() plot2 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::ns(x, knots = g1, df = 2)) + ggplot2::geom_vline(xintercept = g1, colour = "deepskyblue") + ggplot2::ggtitle("Natural Cubic Spline - K=1") + ggplot2::theme_bw() plot3 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::ns(x, knots = g2, df = 3)) + ggplot2::geom_vline(xintercept = g2, colour = "deepskyblue") + ggplot2::ggtitle("Natural Cubic Spline - K=2") + ggplot2::theme_bw() plot4 <- ggplot2::ggplot(gapminder.2011,ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(color='red', alpha=0.3) + ggplot2::geom_smooth(method="lm", formula= y ~ splines::ns(x, knots = g10, df = 11)) + ggplot2::geom_vline(xintercept = g10, colour = "deepskyblue") + ggplot2::ggtitle("Natural Cubic Spline - K=10") + ggplot2::theme_bw() gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2) Do you notice any difference in the shape of the cubic splines vs. that of the natural cubic splines? Cross-validation (again!) can be used to determine the optimal $$K$$: compute the estimated $$\text{SSRes}$$ for various $$K$$ (10-fold CV, say), and pick the $$K^*$$ that minimizes the error. Regression splines often give better results than polynomial regression because they induce flexibility via a large number of parameters $$K$$ with low polynomial degree $$d\leq 3$$, rather than through high $$d$$ of the latter (and the wild variability that such polynomials have, especially near the boundaries of the predictor’s range, as can be observed in the polynomial regression examples above. Multivariate Adaptive Regression Splines We can reduce the polynomial degree to $$d\leq 2$$ without losing too much curve fitting accuracy by considering bases consisting of functions of the forms: $1,(x-\xi_k)_{\pm},(x-\xi_{k_1})_{\pm}(x-\xi_{k_2})_{\pm},$ where $$(x-t)_{\pm}$$ is one of the two hinge functions \begin{aligned} (x-t)_+=\begin{cases}x-t & \text{if x>t} \\ 0 & \text{else}\end{cases} \\ (x-t)_-=\begin{cases}t-x & \text{if x<t} \\ 0 & \text{else}\end{cases} \end{aligned} For instance, $$(x-1)_{\pm}$$, $$(x-1)_{+}(x-5)_+$$, and $$(x-1)_+(x-8)_-$$ are shown below: A multivariate adaptive regression spline (MARS) model is expressed as $y_i=\sum_{k=1}^K\beta_kh_k(x_i)+\varepsilon_i,\quad i=1,\ldots,N,$ where $$h_k$$ is either a constant function, a hinge function, or a product of hinge functions. MARS iteratively adds terms to its model; once a stopping criterion is met, unwanted terms are removed. The model growth’s parallels the growth of tree-based models, which we will discuss in Module 13, and it has the same advantage that the knots are selected automatically. Artificial Dataset Example Let us take a look at a synthetic dataset, based off of: $y=f(x)=\frac{\sin(\pi x)}{10}−\sqrt{x}+\exp(x/10)+\varepsilon,$ where $$\varepsilon\sim \mathcal{N}(0,0.04^2)$$. set.seed(1234) fx=function(x){ sin(pi*x)/10-sqrt(x)+exp(x/10) } x=sort(runif(50, 0, 5)) noise=rnorm(50, 0, 0.04) y=fx(x)+noise plot(x, y, col=4) x.vec=seq(0,6, length.out=100) lines(x.vec, fx(x.vec), col="grey", lty=2) We can fit the data using package mda’s mars() function, in R (but this is a licensed implementation of MARS). There is another implementation in the earth package: enhanced adaptive regression through hinges, or EARTH. Let us use only functions of degree 1 (linear functions and linear hinges, but no interaction terms) for the time being: MARS.1=mda::mars(x, y, degree=1) The output is rather lenghty and is suppressed for readability: • call provides the model;

• $fitted.values contains the estimated values $$\hat{y}_i$$; • $residuals contain the residuals $$\hat{y}_i−y_i$$, and

• $x gives the hinge functions used in the final model. Let’s see how good a job MARS did: plot(x, y, col=4, main="MARS with no interaction terms") x.vec=seq(0,6, length.out=100) lines(x.vec, fx(x.vec), col="grey", lty=2) points(x, MARS.1$fitted.values, col=2, pch=16)
abline(v=MARS.1$cuts[MARS.1$selected.terms[-1]], col="light grey")

The EARTH output is identical, and would be obtained thus:

EARTH.1=earth::earth(x, y, degree=1)
summary(EARTH.1)
Call: earth(x=x, y=y, degree=1)

coefficients
(Intercept)   -0.16254461
h(1.3341-x)    0.70785002
h(x-1.3341)   -0.05502561
h(x-2.53653)  -0.27853205
h(x-3.38547)   0.37209809

Selected 5 of 6 terms, and 1 of 1 predictors
Termination condition: RSq changed by less than 0.001 at 6 terms
Importance: x
Number of terms at each degree of interaction: 1 4 (additive model)
GCV 0.002341396    RSS 0.07871774    GRSq 0.9759754    RSq 0.9831798

What about interaction terms? In order for MARS or EARTH to consider such terms, we must first provide a second predictor.

xnew = x*x
data = data.frame(x,xnew,y)
EARTH.2=earth::earth(y ~ x + xnew , data=data, degree=2)
summary(EARTH.2)
Call: earth(formula=y~x+xnew, data=data, degree=2)

coefficients
(Intercept)     -0.21273261
h(1.43112-x)     0.68806424
h(x-2.62849)    -0.43057541
h(xnew-10.446)   0.05531043

Selected 4 of 6 terms, and 2 of 2 predictors
Termination condition: RSq changed by less than 0.001 at 6 terms
Importance: x, xnew
Number of terms at each degree of interaction: 1 3 (additive model)
GCV 0.00273995    RSS 0.09437758    GRSq 0.971886    RSq 0.9798336

What does the plot look like?

plot(x, y, col=4, main="MARS with interaction terms")
x.vec=seq(0,6, length.out=100)
points(x, EARTH.2$fitted.values, col=2, pch=16) abline(v=EARTH.2$cuts[EARTH.2$selected.terms[-1]], col="light grey")  EARTH.2$cuts
                      x     xnew
(Intercept)    0.000000  0.00000
h(x-1.43112)   1.431116  0.00000
h(1.43112-x)   1.431116  0.00000
h(xnew-10.446) 0.000000 10.44602
h(10.446-xnew) 0.000000 10.44602
h(x-2.62849)   2.628488  0.00000

Can you spot the non-linear components?

Housing Dataset Example

In this section, we will analyze a housing dataset related to house selling prices in Ames, Iowa (modified from [236]).

We start by reading in the data:

dat.Housing=read.csv("Data/VE_Housing.csv", header=TRUE, stringsAsFactors = TRUE)
dim(dat.Housing)
[1] 1460   81

Next, we count the number of missing values for each variable, excluding those variables with complete rows.

missing=attributes(which(apply(is.na(dat.Housing), 2, sum)>0))$names apply(is.na(dat.Housing[,missing]), 2, sum)  LotFrontage Alley MasVnrType MasVnrArea BsmtQual BsmtCond 259 1369 8 8 37 37 BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType 38 37 38 1 690 81 GarageYrBlt GarageFinish GarageQual GarageCond PoolQC Fence 81 81 81 81 1453 1179 MiscFeature 1406  The housing dataset thus consists of $$n=1460$$ observations with $$p=79$$ predictors. There are two other variables: Id and SalePrice, representing the index variable and the response variable, respectively.217 Furtheremore, the variables • LotFrontage • Alley • FireplaceQu • PoolQC • Fence, and • MiscFeature all have anywhere from 259 to 1406 missing observations. The proportions of missing values in these variables are probably too high for imputation (see Module 8 for details), so we elect to remove them from further analyses. Also, note that the remaining major missing variables are all related to Garage and Basement, with corresponding variables missing for the same houses. Given that there are other variables associated with Garage and Basement, we suspect these variables will not play a crucial role in model building, and we make the decision to remove them from further analyses. For the remaining three variables with missing values (MasVnrType, MasVnrArea, and Electrical), the number of missing observations are so small that we could easily • impute these values, or • perform list-wise deletion. For the purposes of this example, we will select the latter options and delete all columns with missing values. dat.Housing.new = dat.Housing[,!colnames(dat.Housing)%in%missing] dim(dat.Housing.new) [1] 1460 62 We also remove the index variable ID: dat.Housing.new = subset(dat.Housing.new, select = -c(Id)) dim(dat.Housing.new) [1] 1460 61 In order to evaluate the effectiveness of the eventual model (i.e., to have good predictive power without overfitting the data), we split the Housing dataset into training and testing sets. The model is then developed using the training set (i.e., optimized using a subset of data), and then later tested for its prediction power using the testing set. We select roughly 80% of the observations (1160) to be part of the training set: set.seed(1234) # for replicability n.train=1160 ind.train=sample(1:nrow(dat.Housing.new), n.train) The training and testing sets are thus: dat.train=dat.Housing.new[ind.train,] dat.test=dat.Housing.new[-ind.train,] We train EARTH (with interactions) on the training data: EARTH.3 <- earth::earth(SalePrice~., data=dat.train, degree=2) summary(EARTH.3) Call: earth(formula=SalePrice~., data=dat.train, degree=2) coefficients (Intercept) 317.95604 Exterior1stBrkFace 18.17930 FoundationPConc 34.63490 h(14442-LotArea) -0.00198 h(LotArea-14442) 0.00048 h(7-OverallQual) -19.21820 h(OverallQual-7) 35.79700 h(7-OverallCond) -5.03664 h(2004-YearBuilt) -0.69748 h(YearBuilt-2004) 6.36449 h(1056-BsmtFinSF1) -0.02232 h(BsmtFinSF1-1056) 0.02788 h(2121-TotalBsmtSF) -0.01953 h(TotalBsmtSF-2121) -0.05503 h(2898-GrLivArea) -0.04853 h(GrLivArea-2898) -0.31464 h(1-Fireplaces) -4.60511 h(Fireplaces-1) 5.78404 h(642-GarageArea) -0.02169 NeighborhoodCrawfor * h(2004-YearBuilt) 0.36780 Condition1Norm * h(GrLivArea-2898) 0.15117 h(YearBuilt-2004) * SaleTypeNew 8.47065 ExterQualGd * h(GarageArea-642) 0.09463 FoundationPConc * h(2898-GrLivArea) -0.01734 h(GrLivArea-2898) * SaleTypeWD 0.26916 h(OverallQual-7) * h(TotalBsmtSF-1777) 0.03112 h(OverallQual-7) * h(1777-TotalBsmtSF) -0.03789 h(7-OverallQual) * h(2898-GrLivArea) 0.00738 h(OverallQual-7) * h(2898-GrLivArea) -0.01228 h(OverallQual-7) * h(TotRmsAbvGrd-9) 27.01137 h(OverallQual-7) * h(9-TotRmsAbvGrd) 7.13483 h(7-OverallCond) * h(X1stFlrSF-1269) -0.00974 h(7-OverallCond) * h(WoodDeckSF-316) -0.03691 h(7-OverallCond) * h(316-WoodDeckSF) -0.01602 h(2005-YearBuilt) * h(1056-BsmtFinSF1) 0.00030 h(YearBuilt-2005) * h(1056-BsmtFinSF1) -0.00928 Selected 36 of 39 terms, and 19 of 188 predictors Termination condition: RSq changed by less than 0.001 at 39 terms Importance: OverallQual, GrLivArea, YearBuilt, SaleTypeWD, BsmtFinSF1, ... Number of terms at each degree of interaction: 1 18 17 GCV 396.2527 RSS 392191.8 GRSq 0.9377484 RSq 0.9467931 and predict SalePrice on the testing data: yhat.EARTH.3=predict(EARTH.3, dat.test) We can evaluate the quality of the predictions on the testing set either by computing $$\text{MSE}_{\text{Te}}$$ directly: mean((yhat.EARTH.3- dat.test$SalePrice)^2)
[1] 628.3848

but this numerical value is more or less useless on its own. We get a better sense for the quality of the prediction on the testing set by comparing the actual SalePrice values to the EARTH predicted SalePrice values:

xlimit=ylimit=c(0,600)
plot(NA, col=2, xlim=xlimit, ylim=ylimit,
ylab="Predicted Price ($1,000)", xlab="Actual Price ($1,000)",
main="MARS/EARTH SalePrice predictions (w column-wise deletion)")
abline(h=seq(0,600, length.out=13), v=seq(0,600, length.out=13), lty=2, col="grey")
abline(a=0, b=1)
y=gapminder.2011life_expectancy ss00 = stats::smooth.spline(x, y, spar=0) ss05 = stats::smooth.spline(x, y, spar=0.5) ss10 = stats::smooth.spline(x, y, spar=1) ss15 = stats::smooth.spline(x, y, spar=1.5) In order to be able to display the smoothing splines over the datapoints, we use the broom::augment() function, which provide the value of the spline at the various fertility values in the dataset. plot1 <- ggplot2::ggplot(broom::augment(ss00, gapminder.2011), ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(colour="red") + ggplot2::geom_line(ggplot2::aes(y = .fitted), colour="blue", size=1.1) + ggplot2::ggtitle("Smoothing Spline - spar=0") + ggplot2::theme_bw() plot2 <- ggplot2::ggplot(broom::augment(ss05, gapminder.2011), ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(colour="red") + ggplot2::geom_line(ggplot2::aes(y = .fitted), colour="blue", size=1.1) + ggplot2::ggtitle("Smoothing Spline - spar=0.5") + ggplot2::theme_bw() plot3 <- ggplot2::ggplot(broom::augment(ss10, gapminder.2011), ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(colour="red") + ggplot2::geom_line(ggplot2::aes(y = .fitted), colour="blue", size=1.1) + ggplot2::ggtitle("Smoothing Spline - spar=1") + ggplot2::theme_bw() plot4 <- ggplot2::ggplot(broom::augment(ss15, gapminder.2011), ggplot2::aes(x=fertility, y=life_expectancy)) + ggplot2::geom_point(colour="red") + ggplot2::geom_line(ggplot2::aes(y = .fitted), colour="blue", size=1.1) + ggplot2::ggtitle("Smoothing Spline - spar=1.5") + ggplot2::theme_bw() gridExtra::grid.arrange(plot1, plot2, plot3, plot4, ncol=2) Note the evolution of a flexible but non-interpretable model (the wiggly curve associated to spar=0) to a rigid but highly interpretable model (the line associated to spar=1.5) as the spar values increase. 12.5.3 Generalized Additive Models While polynomial regression and splines can be applied to predictor sets, they are best-suited to predicting a response $$Y$$ on the basis of a single predictor $$X$$ (the model complexity increases quickly if more than one predictor is present). Generalized additive models (GAM) allow for flexible non-linearities in several variables while retaining the additive structure of linear models: $y_i=\beta_0+f_1(x_{i,1})+\cdots+f_p(x_{i,p})+\varepsilon_i,\quad i=1,\ldots,N$ where each of the $$f_j$$ can be derived using any of the methods previously discussed; if \begin{aligned} f_1(x_i)&=\beta_{1,1}b_{1,1}(x_{i,1})+\cdots + \beta_{1,L_1}b_{1,L_1}(x_{i,1}) \\ &\ \vdots \\ f_p(x_i)&=\beta_{p,1}b_{p,1}(x_{i,p})+\cdots + \beta_{p,L_p}b_{p,L_p}(x_{i,p}),\end{aligned} say, we would fit the data using OLS (but this cannot be done if one of the components is a smoothing spline, for instance, or if it is non-linear in some other way). In practice, using natural cubic splines for the quantitative components seem to work as well as smoothing spline, when it comes to making predictions. GAM are implemented in R using the mgcv::gam() function; a typical call might look like: library(mgcv) gam(y~s(x1,df=5) + lo(x2,spar=0.5)+ bs(x3,df=4) + ns(x4,df=5):ns(x5,df=5) + x6, data=dat) which would indicate that the contribution of: • $$X_1$$ is given by smoothing spline with 5 degrees of freedom, • $$X_2$$ is given by a local regression with spar=0.5, • $$X_3$$ is given by a cubic spline with 4 degrees of freedom, • the fourth component is an interaction term based on natural splines for $$X_4$$ and $$X_5$$ (each with 5 degrees of freedom), and • $$X_6$$ is directly added to the model. GAM provide a useful compromise between linear models and fully non-parametric models. Advantages: • GAM can fit a non-linear $$f_j$$ to each predictor $$X_j$$, so that they could capture trends that linear regression would miss; • GAM can reduce the number of data transformations to try out manually on each predictor $$X_j$$; • non-linear fits may improve accuracy of predictions for the response $$Y$$; • GAM are useful for inference due to their additivity – the effect of $$X_j$$ on $$Y$$ (while keeping other predictors fixed) can be analyzed separately; • the overall smoothness of the model can be summarized via effective degrees of freedom/parameters. Disadvantages: • GAM still suffer from the curse of dimensionality; • GAM are restricted to additive models – interaction terms can be added manually by introducing new predictors $$X_j\times X_k$$, as can interaction functions $$f_{j,k}(X_j,X_k)$$ (using local regression or MARS, say), but they quickly get out of hand (due to CoD issues). Example The charts below show the individual contributions of fertility, infant mortality, GDP, and continental membership to life expectancy in the 2011 Gapminder data (using the entire set as training data). library(mgcv) b<-gam(gapminder.2011life_expectancy ~  s(gapminder.2011$fertility) + s(gapminder.2011$infant_mortality) +
s(gapminder.2011$gdp) + gapminder.2011$continent)
summary(b)

Family: gaussian

Formula:
gapminder.2011$life_expectancy ~ s(gapminder.2011$fertility) +
s(gapminder.2011$infant_mortality) + s(gapminder.2011$gdp) +
gapminder.2011$continent Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 68.1186 0.7470 91.190 < 2e-16 *** gapminder.2011$continentAmericas   4.4787     1.1161   4.013 9.30e-05 ***
gapminder.2011$continentAsia 4.7110 0.9993 4.714 5.35e-06 *** gapminder.2011$continentEurope     3.4179     1.3209   2.588   0.0106 *
gapminder.2011$continentOceania -0.2891 1.3798 -0.210 0.8343 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F p-value s(gapminder.2011$fertility)        1.000  1.000  4.474   0.036 *
s(gapminder.2011$infant_mortality) 3.027 3.800 40.541 <2e-16 *** s(gapminder.2011$gdp)              1.478  1.779  0.367   0.575
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.828   Deviance explained = 83.8%
GCV = 13.199  Scale est. = 12.363    n = 166

We see in the outcome that the intercept is $$\beta_0=68.1186$$ and that $\beta_{\text{continent}}=\begin{cases} 0 & \text{Africa} \\ 4.4787 & \text{America} \\ 4.7110 & \text{Asia} \\ 3.4179 & \text{Europe} \\ -0.2891 & \text{Oceania} \\ \end{cases}$ so that predictions take the form \begin{aligned} \text{life expectancy}&\approx \beta_0 + f_1(\text{fertility})+f_2(\text{infant mortality}) \\ & \ \ \ \ + f_3(\text{gdp}) + \beta_{\text{continent}}.\end{aligned}

plot.gam(b)

For instance, the life expectancy for an American country with fertility$$=3$$, infant mortality$$=1$$, GDP$$=6\times 10^{12}$$ would be approximately $68.1+0+10+2+4.5=84.6.$

Take the time to read the mgcv and the gam documentation to better understand how these work in practice (in particular, how to make predictions on test/new observations).

References

[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[236]
B. Boehmke and B. Greenwell, Hands on Machine Learning with R. CRC Press.