15.3 Feature Selection

As seen in the previous section, dimension reduction methods can be used to learn low-dimensional manifolds for high-dimensional data, with the hope that the resulting loss in information content can be kept small. Unfortunately, this is not always feasible.

There is another non-technical, yet perhaps more problematic, issue with manifold learning techniques: the reduction often fails to provide an easily interpretable set of coordinates in the context of the original dataset.

For instance, in a dataset with the 4 features \[X_1=\text{Age},\ X_2=\text{Height},\ X_3=\text{Weight},\text{ and }X_4=\text{Gender }\in \{0,1\},\] say, it is straightforward to justify a data-driven decision based on the rule \(X_1= \text{Age}>25\), for example, but perhaps substantially harder to do so for a reduced rule such as \[Y_2=3(\text{Age}-\overline{\text{Age}})-(\text{Height}-\overline{\text{Height}})+4(\text{Weight}-\overline{\text{Weight}})+\text{Gender}>7,\] even if there is nothing wrong with the rule from a technical perspective.

Furthermore, datasets often contain irrelevant and/or redundant features; identifying and removing these variables is a common data processing task. The motivations for doing so are varied, but usually fall into one of two categories:

  • the modeling tools do not handle redundant variables well, due to variance inflation or similar issues, and

  • as an attempt by analysts to overcome the curse of dimensionality or to avoid situations where the number of variables is larger than the number of observations.

In light of the comment above, the goal of feature selection is to remove (and not to transform or project) any attribute that adds noise and reduces the performance of a model, that is to say, to retain a subset of the most relevant features300, which can help create simpler models, decrease a statistical learner’s training time, and reduce overfitting.

There are various feature selection methods, typically falling in one of three families – filter methods, wrapper methods, and embedded methods (most of the next two sections is inspired by [6]):

  • filter methods focus on the relevance of the features, applying a specific ranking metric to each feature, individually. The variables that do not meet a preset benchmark301 are then removed from consideration, yielding a subset of the most relevant features according to the selected ranking metric; different metrics, and different thresholds, might retain different relevant features;

  • wrapper methods focus on the usefulness of each feature to the task of interest (usually classification or regression), but do not consider features individually; rather, they evaluate and compare the performance of different combinations of features in order to select the best-performing subset of features, and

  • embedded methods are a combination of both, using implicit metrics to evaluate the performance of various subsets.

Feature selection methods can also be categorized as unsupervised or supervised:

  • unsupervised methods determine the importance of features using only their values (with potential feature interactions), while

  • supervised methods evaluate each feature’s importance in relationship with the target feature.

Wrapper methods are typically supervised. Unsupervised filter methods search for noisy features and include the removal of constant variables, of ID-like variables (i.e. different on all observations), or features with low variability.

15.3.1 Filter Methods

Filter methods evaluate features without resorting to the use of a classification/regression algorithm. These methods can either be

  • univariate, where each feature is ranked independently, or

  • multivariate, where features are considered jointly.

A filter criterion is chosen based on which metric suits the data or problem best.302 The selected criterion is used to assign a score to, and rank, the features which are then retained or removed in order to yield a relevant subset of features.

Features whose score lies above (or below, as the case may be) some pre-selected threshold \(\tau\) are retained (or removed); alternatively, features whose rank lies below (or above as the case may be) some pre-selected threshold \(\nu\) are retained (or removed). Such methods are advantageous in that they are computationally efficient. They also tend to be robust against overfitting as they do not incorporate the effects of the feature subset selection on classification/regression performance.

There are a number of commonly-used filter criteria, including the Pearson correlation coefficient, information gain (or mutual information), and relief [6]. Throughout, let \(Y\) be the target variable (assuming that there is one), and \(X_1, \ldots, X_p\) be the predictors.

Pearson Correlation Coefficient

The Pearson correlation coefficient [302] quantifies the linear relationship between two continuous variables. For a predictor \(X_i\), the Pearson correlation coefficient between \(X_i\) and \(Y\) is \[\rho_{i}=\frac{\text{Cov}(X_i,Y)}{\sigma_{X_i}\sigma_{Y}}.\] Features for which \(|\rho_i|\) is large (near 1) are linearly correlated with \(Y\), those for which \(|\rho_i|\approx 0\) are not linearly (or anti-linearly) correlated with \(Y\) (which could mean that they are uncorrelated with \(Y\), or that the correlation is not linear or anti-linear). Only those features with (relatively) strong linear (or anti-linear) correlation are retained.

This correlation \(\rho_i\) is only defined if both the predictor \(X_i\) and the outcome \(Y\) are numerical; there are alternatives for categorical \(X_i\) and \(Y\), or mixed categorical-numerical \(X_i\) and \(Y\) [303], [304], [305].

In order to get a better handle on what filter feature selection looks like in practice, consider the Global Cities Index dataset [306], which ranks prominent cities around the globe on a general scale of “Alpha”, “Beta”, “Gamma”, and “Sufficiency” (1, 2, 3, 4, respectively).

This dataset contains geographical, population, and economics data for 68 ranked cities.

globalcities <- data.frame(read.csv("Data/globalcities.csv", stringsAsFactors = TRUE))
colnames(globalcities)[2:7] <- c("City.Area","Metro.Area","City.Pop","Metro.Pop", "Ann.Pop.Growth", "GDPpc")
colnames(globalcities)[12:17] <- c("Higher.Ed.Insts","Life.Exp.M","Life.Exp.F","Hospitals", "Museums", "Air.Quality")
'data.frame':   68 obs. of  18 variables:
 $ Rating           : int  1 3 2 1 1 1 2 1 2 1 ...
 $ City.Area        : num  165 30.7 38.9 1569 102.6 ...
 $ Metro.Area       : num  807 25437 381 7762 3236 ...
 $ City.Pop         : num  0.76 3.54 0.66 5.72 1.62 ...
 $ Metro.Pop        : num  1.4 4.77 4.01 6.5 3.23 ...
 $ Ann.Pop.Growth   : num  0.01 0.26 0 0.03 0.01 0.04 0 0.01 0.01 0.01 ...
 $ GDPpc            : num  46 21.2 30.5 23.4 36.3 20.3 33.3 15.9 69.3 45.6 ...
 $ Unemployment.Rate: num  0.05 0.12 0.16 0.02 0.15 0.01 0.16 0.1 0.07 0.16 ...
 $ Poverty.Rate     : num  0.18 0.2 0.2 0 0.2 0.01 0.22 0.22 0.17 0.26 ...
 $ Major.Airports   : int  1 1 1 2 1 1 2 1 1 2 ...
 $ Major.Ports      : int  1 0 1 1 1 0 2 0 1 1 ...
 $ Higher.Ed.Insts  : int  23 10 21 37 8 89 30 19 35 25 ...
 $ Life.Exp.M       : num  76.3 75.3 78 69 79 79 82 74.6 74.8 77 ...
 $ Life.Exp.F       : num  80.8 80.8 83.7 74 85.2 83 88 79.7 81.1 82.6 ...
 $ Hospitals        : int  7 7 23 173 45 551 79 22 12 43 ...
 $ Museums          : int  68 36 47 27 69 156 170 76 30 25 ...
 $ Air.Quality      : int  24 46 41 54 35 121 26 77 17 28 ...
 $ Life.Expectancy  : num  78.5 78 80.8 71.5 82.1 ...
Rating City.Area Metro.Area City.Pop Metro.Pop Ann.Pop.Growth
Min. :1.000 Min. : 15.9 Min. : 118 Min. : 0.1000 Min. : 0.460 Min. :-0.02000
1st Qu.:1.000 1st Qu.: 123.7 1st Qu.: 1616 1st Qu.: 0.9675 1st Qu.: 3.127 1st Qu.: 0.01000
Median :1.000 Median : 410.0 Median : 3598 Median : 2.9100 Median : 6.485 Median : 0.01000
Mean :1.441 Mean : 944.4 Mean : 6717 Mean : 4.8276 Mean : 8.657 Mean : 0.02265
3rd Qu.:2.000 3rd Qu.:1334.7 3rd Qu.: 7808 3rd Qu.: 7.7425 3rd Qu.:11.660 3rd Qu.: 0.02000
Max. :4.000 Max. :7304.0 Max. :31815 Max. :23.0200 Max. :34.990 Max. : 0.26000
GDPpc Unemployment.Rate Poverty.Rate Major.Airports Major.Ports Higher.Ed.Insts
Min. : 0.50 Min. :0.00000 Min. : 0.0000 Min. :0.000 Min. : 0.00 Min. : 2.00
1st Qu.:21.05 1st Qu.:0.04000 1st Qu.: 0.0675 1st Qu.:1.000 1st Qu.: 1.00 1st Qu.: 14.00
Median :39.05 Median :0.07000 Median : 0.1500 Median :1.000 Median : 1.00 Median : 25.50
Mean :35.01 Mean :0.08603 Mean : 0.7519 Mean :1.618 Mean : 1.25 Mean : 42.24
3rd Qu.:47.85 3rd Qu.:0.11000 3rd Qu.: 0.2025 3rd Qu.:2.000 3rd Qu.: 1.00 3rd Qu.: 52.75
Max. :72.93 Max. :0.37000 Max. :17.3500 Max. :5.000 Max. :17.00 Max. :264.00
Life.Exp.M Life.Exp.F Hospitals Museums Air.Quality Life.Expectancy
Min. :53.00 Min. :55.00 Min. : 2.00 Min. : 1.00 Min. : 12.00 Min. :55.00
1st Qu.:72.38 1st Qu.:76.08 1st Qu.: 18.75 1st Qu.: 18.00 1st Qu.: 23.00 1st Qu.:73.93
Median :76.20 Median :81.05 Median : 33.50 Median : 50.00 Median : 35.00 Median :78.78
Mean :74.12 Mean :78.78 Mean : 86.57 Mean : 64.51 Mean : 51.62 Mean :76.45
3rd Qu.:79.00 3rd Qu.:83.62 3rd Qu.: 84.50 3rd Qu.: 79.50 3rd Qu.: 66.25 3rd Qu.:81.04
Max. :84.90 Max. :88.00 Max. :658.00 Max. :350.00 Max. :198.00 Max. :85.00

The R package FSelector contains feature selection tools, including various filter methods (such as chi-squared score, consistency, various entropy-based filters, etc.). Using its filtering functions, we extract the most relevant features to the ranking of a global city (we treat the Rating variable as a numerical response: is this justifiable?).

For instance, if we retain the 5 top predictors for linear correlation (Pearson’s correlation coefficient) with the response Rating, we obtain:

(lincor <- FSelector::linear.correlation(formula = `Rating` ~ ., 
                                        data = globalcities))
(subset_lincor <- FSelector::cutoff.k(lincor, k = 5))
City.Area            0.0007479646
Metro.Area           0.0055023564
City.Pop             0.1196632421
Metro.Pop            0.2030923952
Ann.Pop.Growth       0.1935336738
GDPpc                0.2090866065
Unemployment.Rate    0.2173999333
Poverty.Rate         0.0536585758
Major.Airports       0.2263265771
Major.Ports          0.0563507487
Higher.Ed.Insts      0.0547453393
Life.Exp.M           0.1302972404
Life.Exp.F           0.1412093767
Hospitals            0.1195832079
Museums              0.1553072283
Air.Quality          0.1382099362
Life.Expectancy      0.1380603739
[1] "Major.Airports"    "Unemployment.Rate" "GDPpc"            
[4] "Metro.Pop"         "Ann.Pop.Growth"   

According to the top-5 linear correlation feature selection method, the best features that relate to a city’s global ranking are the number of major airports it has, its unemployment rate, its GDP per capita, its metropolitan population, and its annual population growth.303

Mutual Information

Information gain is a popular entropy-based method of feature selection that measures the amount of dependence between features by quantifying the amount of mutual information between them. In general, this quantifies the amount of information obtained about a predictor \(X_i\) by observing the target feature \(Y\).

Mutual information can be expressed as \[\text{IG}(X_i;Y) = H(X_i) - H(X_i\mid Y),\] where \(H(X_i)\) is the marginal entropy of \(X_i\) and \(H(X_i\mid Y)\) is the conditional entropy of \(X_i\) given \(Y\) [307], an \[H(X_i) = \text{E}_{X_i}[-\log p(X_i)], \quad H(X_i\mid Y) = \text{E}_{(X_i,Y)}[-\log p(X_i\mid Y)],\] where \(p(X_i)\) and \(p(X_i\mid Y)\) are the probability density functions of the random variables \(X_i\) and \(X_i\mid Y\), respectively.

How is \(\text{IG}\) interpreted? Consider the following example: let \(Y\) represent the salary of an individual (continuous), \(X_1\) their hair colour (categorical), \(X_2\) their age (continuous), \(X_3\) their height (continuous), and \(X_4\) their self-reported gender (categorical). A sample of \(n=2144\) individuals is found in demo1.csv.

salary <- data.frame(read.csv("Data/demo1.csv", stringsAsFactors = TRUE))
colnames(salary)[1] <- c("Hair")

Some summary statistics are shown below:

Summary statistics for the salary dataset; two-way tables use decile data.Summary statistics for the salary dataset; two-way tables use decile data.

Figure 15.15: Summary statistics for the salary dataset; two-way tables use decile data.

In a general population, one would expect that the distribution of salaries, among others, is likely to be fairly haphazard, and it might be hard to explain why it has the shape that it does, specifically.

plot1 <- ggplot2::ggplot(salary, ggplot2::aes(x=Hair)) + 
  ggplot2::geom_bar(color='red', fill='white') +

plot2 <- ggplot2::ggplot(salary, ggplot2::aes(x=Gender)) + 
  ggplot2::geom_bar(color='red', fill='white') +

plot3 <- ggplot2::ggplot(salary, ggplot2::aes(x=Age)) + 
  color='red', fill='white', bins=10) +
  ggplot2::geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) + 

plot4 <- ggplot2::ggplot(salary, ggplot2::aes(x=Height)) + 
  color='red', fill='white', bins=10) +
  ggplot2::geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) + 

plot5 <- ggplot2::ggplot(salary, ggplot2::aes(x=Salary)) + 
  color='red', fill='white', bins=10) +
  ggplot2::geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) + 

gridExtra::grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol=2)
Distributions of the predictors and of the response variable.

Figure 13.5: Distributions of the predictors and of the response variable.

Perhaps it could be explained by knowing the relationship between the salary and the other variables? It is this idea that forms the basis of mutual information feature selection.

Applying the definition, one sees that \[\begin{aligned} H(X_1) &=-\sum_{\text{colour}}p(\text{colour})\log p(\text{colour}) \\ H(X_2) &=-\int p(\text{age}) \log p(\text{age}) \, d\text{age} \\ H(X_3) &=-\int p(\text{height}) \log p(\text{height}) \, d\text{height} \\ H(X_4) &=-\sum_{\text{gender}}p(\text{gender})\log p(\text{gender})\\ H(X_1\mid Y) &=-\int p(Y)\left\{\sum_{\text{colour}}p(\text{colour}\mid Y)\log p(\text{colour}\mid Y)\right\}dY \\ H(X_2\mid Y) &=-\iint p(Y) p(\text{age}\mid Y) \log p(\text{age}\mid Y) \, d\text{age}\, dY \\ H(X_3\mid Y) &=-\iint p(Y) p(\text{ht}\mid Y) \log p(\text{ht}\mid Y) \, d\text{ht}\, dY \\ H(X_4\mid Y) &=-\int p(Y)\left\{\sum_{\text{gender}}p(\text{gender}\mid Y)\log p(\text{gender}\mid Y)\right\}\, dY\end{aligned}\]

If the theoretical distributions are known, the entropy integrals can be computed directly (or approximated using standard numerical integration methods).

Gender and hair colour can be fairly easily be modeled using multinomial distributions, but there is more uncertainty related to the numerical variables. A potential approach is to recode the continuous variables age, height, and salary as decile variables \(a_d\), \(h_d\), and \(Y_d\) taking values \(\{1, \ldots, 10\}\) according to which decile of the original variable the observation falls (see Figure 15.15 for the decile breakdown).

The integrals can then be replaced by sums: \[\begin{aligned} H(X_1) &=-\sum_{\text{colour}}p(\text{colour})\log p(\text{colour}) \\ H(X_2) &\approx -\sum_{k=1}^{10} p(\text{a}_d=k) \log p(\text{a}_d=k) \\ H(X_3) &\approx -\sum_{k=1}^{10} p(\text{ht}_d=k) \log p(\text{ht}_d=k) \\ H(X_4) &=-\sum_{\text{gender}} p(\text{gender})\log p(\text{gender})\\ H(X_1\mid Y) &\approx -\sum_{j=1}^{10} p(Y_d=j)\!\!\!\sum_{c\in\text{colour}}\!\!\! p(c\mid Y_d=j)\log p(c\mid Y_d=j) \\ H(X_2\mid Y) & \approx -\sum_{j=1}^{10} p(Y_d=j)\sum_{k=1}^{10}p(a_d=k\mid Y_d=j) \log p(a_d=k\mid Y_d=j) \\ H(X_3\mid Y) &\approx -\sum_{j=1}^{10} p(Y_d=j)\sum_{k=1}^{10}p(h_d=k\mid Y_d=j) \log p(h_d=k\mid Y_d=j) \\ H(X_4\mid Y) &\approx -\sum_{j=1}^{10} p(Y_d=k)\!\!\!\sum_{g\in\text{gender}}\!\!\! p(g\mid Y_d=j)\log p(g\mid Y_d=j)\end{aligned}\]

The results are shown below (using base 10 logarithms, and rounded out to the nearest hundredth):

X H(X) H(X|Y) IG(X;Y) Ratio
Hair 0.24 0.24 0.00 0.00
Age 1.00 0.74 0.26 0.26
Height 1.00 0.96 0.03 0.03
Gender 0.30 0.22 0.08 0.26

The percentage decrease in entropy after having observed \(Y\) is shown in the column “Ratio.” Raw IG numbers would seem to suggest that Gender has a small link to Salary; the Ratio numbers suggest that this could be due to the way the Age and Height levels have been categorized (as deciles).


Relief scores (numerical) features based on the identification of feature value differences between nearest-neighbour instance pairs. If there is a feature value difference in a neighbouring instance pair of the same class, the score of the feature decreases; on the other hand, if there exists a feature value difference in a neighbouring instance pair with different class values, the feature score increases.

More specifically, let \[D = {\{(\mathbf{x}_i, y_i)\}}_{i=1}^n \subset \mathbb{R}^p \times \{\pm1\}\] be a dataset where \(\mathbf{x}_i\) is the \(i\)-th data sample and \(y_n\) is its corresponding class label. For each feature \(j\) and observation \(i\), two values are selected: the \(\textbf{near hit}\) \(H(x_{i,j})\) is the value of \(X_j\) which is nearest to \(x_{i,j}\) among all instances in the same class as \(\mathbf{x}_i\), while the \(\textbf{near miss}\) \(M(x_{i,j})\) is the value of \(X_j\) which is nearest to \(x_{i,j}\) among all instances in the opposite class of \(\mathbf{x}_i\).

The Relief score of the \(j^{\text{\footnotesize th}}\) feature is \[S^d_j = \frac{1}{n}\sum_{i=1}^n\left\{ d(x_{i,j},M(x_{i,j}))-d(x_{i,j},H(x_{i,j}))\right\},\] for some pre-selected distance \(d:\mathbb{R}\times\mathbb{R}\to \mathbb{R}^+_0\).

A feature for which near-hits tend to be nearer to their instances than near-misses are (i.e., for which \[d(x_{i,j},M(x_{i,j})) > d(x_{i,j},H(x_{i,j})),\] on average) will yield larger Relief scores than those for which the opposite is true. Features are deemed relevant when their relief score is greater than some threshold \(\tau\).

There are a variety of Relief-type measurements to accommodate potential feature interactions or multi-class problems304 (ReliefF), but in general Relief is noise-tolerant and robust to interactions of attributes; its effectiveness decreases for small training sets, however [308].

The Relief algorithm is also implemented in the R package CORElearn, as are numerous other methods:

CORElearn::infoCore(what="attrEval") # Classification
CORElearn::infoCore(what="attrEvalReg") # Regression
 [1] "ReliefFequalK"      "ReliefFexpRank"     "ReliefFbestK"      
 [4] "Relief"             "InfGain"            "GainRatio"         
 [7] "MDL"                "Gini"               "MyopicReliefF"     
[10] "Accuracy"           "ReliefFmerit"       "ReliefFdistance"   
[13] "ReliefFsqrDistance" "DKM"                "ReliefFexpC"       
[16] "ReliefFavgC"        "ReliefFpe"          "ReliefFpa"         
[19] "ReliefFsmp"         "GainRatioCost"      "DKMcost"           
[22] "ReliefKukar"        "MDLsmp"             "ImpurityEuclid"    
[25] "ImpurityHellinger"  "UniformDKM"         "UniformGini"       
[28] "UniformInf"         "UniformAccuracy"    "EqualDKM"          
[31] "EqualGini"          "EqualInf"           "EqualHellinger"    
[34] "DistHellinger"      "DistAUC"            "DistAngle"         
[37] "DistEuclid"        
[1] "RReliefFequalK"      "RReliefFexpRank"     "RReliefFbestK"      
[4] "RReliefFwithMSE"     "MSEofMean"           "MSEofModel"         
[7] "MAEofModel"          "RReliefFdistance"    "RReliefFsqrDistance"

We start by declaring the target variable Rating as a categorical variable. We also create a duplicate variable Rating2, to determine the various scores directionality.

globalcities.cat <- globalcities
globalcities.cat$Rating <- as.factor(globalcities.cat$Rating)

Now, let’s evaluate the predictors relevance, using InfGain and ReliefFpe, say:

InfGain.wts <- CORElearn::attrEval(Rating ~ ., globalcities.cat, 
ReliefF.wts <- CORElearn::attrEval(Rating ~ ., globalcities.cat, 
                  InfGain.wts ReliefF.wts
City.Area          0.05898196 0.035227591
Metro.Area         0.10501737 0.020691027
City.Pop           0.10422894 0.050779784
Metro.Pop          0.13022307 0.006038070
Ann.Pop.Growth     0.04624721 0.005785050
GDPpc              0.12909419 0.038870952
Unemployment.Rate  0.08824268 0.069113116
Poverty.Rate       0.06966089 0.004854548
Major.Airports     0.03565266 0.015932195
Major.Ports        0.04175416 0.014967595
Higher.Ed.Insts    0.02868171 0.008031587
Life.Exp.M         0.09504723 0.119356004
Life.Exp.F         0.04937724 0.073946982
Hospitals          0.18080212 0.011268651
Museums            0.10732739 0.017808737
Air.Quality        0.05838298 0.057001715
Life.Expectancy    0.07730300 0.101021497

For classification tasks (categorical targets), more relevant features are those for which the scores are higher (this is not necessarily the case for regression tasks, however).

There are many filter methods [4], [6], including:

  • other correlation metrics (Kendall, Spearman, point-biserial correlation, etc.);

  • other entropy-based metrics (gain ratio, symmetric uncertainty, etc.);

  • other relief-type algorithms (ReliefF, Relieved-F, etc.);

  • \(\chi^2-\)test;

  • ANOVA;

  • Fisher Score;

  • Gini Index;

  • etc.

The list is by no means exhaustive, but it provides a fair idea of the various types of filter-based feature selection metrics. The following example illustrates how some of these metrics are used to extract relevant attributes.

15.3.2 Wrapper Methods

Wrapper methods offer a powerful way to address problem of variable selection. Wrapper methods evaluate the quality of subsets of features for predicting the target output under the selected predictive algorithm and select the optimal combination (for the given training set and algorithm).

In contrast to filter methods, wrapping methods are integrated directly into the classification or clustering process (see Figure 15.16 for an illustration of this process).

Feature selection process for wrapper methods in classification problems [@DSML_CS10].

Figure 15.16: Feature selection process for wrapper methods in classification problems [6].

Wrapper methods treats feature selection as a search problem in which different subsets of features are explored. This process can be computationally expensive as the size of the search space increases exponentially with the number of predictors; even for modest problems an exhaustive search can quickly become impractical.

In general, wrapper methods iterate over the following steps, until an “optimal” set of features is identified:

  • select a feature subset, and

  • evaluate the performance of the selected feature subset.

The search ends once some desired quality is reached (such as adjusted \(R^2\), accuracy, etc.). Various search methods are used to traverse the feature phase space and provide an approximate solution to the optimal feature subset problem, including: hill-climbing, best-first, and genetic algorithms.

Wrapper methods are not as efficient as filter methods and are not as robust against over-fitting. However, they are very effective at improving the model’s performance due to their attempt to minimize the error rate (which unfortunately can also lead to the introduction of implicit bias in the problem [6]).

15.3.3 Subset Selection Methods

Stepwise selection is a form of Occam’s Razor: at each step, a new feature is considered for inclusion or removal from the current features set based on some criterion (\(F-\)test, \(t-\)test, etc.). Greedy search methods such as backward elimination and forward selection have proven to be both quite robust against over-fitting and among the least computationally expensive wrapper methods.

Backward elimination begins with the full set of features and sequentially eliminates the least relevant ones until further removals increase the error rate of the predictive model above some utility threshold.

Forward selection works in reverse, beginning the search with an empty set of features and progressively adding relevant features to the ever growing set of predictors. In an ideal setting, model performance should be tested using cross-validation.

Stepwise selection methods are extremely common, but they have severe limitations (which are not usually addressed) [309], [310]:

  • the tests are biased, since they are all based on the same data;

  • the adjusted \(R^2\) only takes into account the number of features in the final fit, and not the degrees of freedom that have been used in the entire model;

  • if cross-validation is used, stepwise selection has to be repeated for each sub-model but that is not usually done, and

  • it represents a classic example of \(p\)-hacking.

Consequently, the use of stepwise methods is contraindicated in the machine learning context.

15.3.4 Regularization (Embedded) Methods

An interesting hybrid is provided by the least absolute shrinkage and selection operator (LASSO) and its variants, which are discussed in Section 12.2.4.

15.3.5 Supervised and Unsupervised Feature Selection

While feature selection methods are usually categorised as filter, wrapper, or embedded, they can also be categorised as supervised or unsupervised methods. Whether a feature selection method is supervised or not boils down to whether the labels of objects/instances are incorporated into the feature reduction process (or not).

The methods that have been described in this section were all supervised.

In unsupervised methods, feature selection is carried out based on the characteristics of the attributes, without any reference to labels or a target variable. In particular, for clustering problems (see Section 14), only unsupervised feature selection methods can be used [5].

Unsupervised feature selection methods include:

  • identifying ID-like predictors;

  • identifying constant (or nearly constant) predictors;

  • identifying predictors that are in a multicolinear relationship with other variables;

  • identifying clusters of predictors, etc.


C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms and Applications. CRC Press, 2014.
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer, 2015.
C. C. Aggarwal, Ed., Data Classification: Algorithms and Applications. CRC Press, 2015.
Wikipedia, Mutual information.”
Y. Sun and D. Wu, “A RELIEF based feature extraction algorithm,” Apr. 2008, pp. 188–195.
F. E. Harrell, Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer International Publishing, 2015.