12.7 Exercises

1. Let $$(X,Y)$$ be a bivariate normal random variable with parameters $\mu_X=12,\mu_Y=-7,\sigma^2_X=1,\sigma^2_Y=2,\sigma_{XY}=4.$ Consider the parameter $\alpha=\frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}.$ Using a bootstrap procedure with $$N=100$$ samples and $$M=200$$ replicates, provide a confidence interval for the true value of $$\alpha$$.

2. Explicitly obtain the polynomial regression models in the Gapminder Example, for $$d=2, 3, 4$$.

3. Play around with a variety of knots in the step function regression models for the Gapminder Example, and build the corresponding confidence intervals (including those of the example). How would you determine the number and location of the knots?

4. Determine the optimal number of knots $$K$$ for cubic splines and natural cubic splines for the Gapminder Example, using cross-validation.

5. Build piecewise cubic splines and continuous piecewise cubic splines for the Gapminder Example. Use cross-validation to determine the optimal number of knots.

6. Predict life expectancy of countries in 2011 using the various spline models (in the text and in the exerices) on the Gapminder dataset, with training/testing pairs. Evaluate your models. Which ones perform best?

7. Predict life expectancy of countries in 2011 using various GAM models on the Gapminder dataset, with training/testing pairs. Evaluate your models. Which ones perform best?

8. Consider the dataset algae_blooms.csv, as in Section 12.6. Run the analysis with a scaled dataset. Run the analysis with a PCA-reduced dataset. Do the results change significantly?

9. Consider the datasets GlobalCitiesPBI.csv, 2016collisionsfinal.csv, polls_us_election_2016.csv HR_2016_Census_simple.csv, and UniversalBank.csv (as described in the Exercises sections of Modules 8, 9, 11). For each of these datasets, identify a response variable (or more than one, if the fancy strikes you) and predictors, and build models to predict the response(s) using the various methods discussed in this module. Evaluate and rank the resulting models. You may need to clean, transform, and visualize the data along the way.

10. Complete the following definition of the Python function kfoldCV(k, data, yname, formulas) where k is the number of folds, data is the data set, yname is the column name of the dependent variable, and formulas is a list of formulas. The function should return the tuple fit, f where fit is the OLS model for the formula f in formulas that has the minimum $$k-$$fold CV estimate. Use your function on the mpg data set with $$k=10$$ to obtain a good model for predicting mpg.

import seaborn as sns

return fit, f