## 15.3 Feature Selection

As seen in the previous section, dimension reduction methods can be used to learn low-dimensional manifolds for high-dimensional data, with the hope that the resulting loss in information content can be kept small. Unfortunately, this is not always feasible.

There is another non-technical, yet perhaps more problematic, issue with manifold learning techniques: the reduction often fails to provide an easily **interpretable** set of coordinates in the context of the original dataset.

For instance, in a dataset with the 4 features \[X_1=\text{Age},\ X_2=\text{Height},\ X_3=\text{Weight},\text{ and }X_4=\text{Gender }\in \{0,1\},\] say, it is straightforward to **justify** a data-driven decision based on the rule \(X_1= \text{Age}>25\), for example, but perhaps substantially harder to do so for a reduced rule such as \[Y_2=3(\text{Age}-\overline{\text{Age}})-(\text{Height}-\overline{\text{Height}})+4(\text{Weight}-\overline{\text{Weight}})+\text{Gender}>7,\] even if there is nothing wrong with the rule from a technical perspective.

Furthermore, datasets often contain **irrelevant** and/or **redundant** features; identifying and removing these variables is a common data processing task. The motivations for doing so are varied, but usually fall into one of two categories:

the modeling tools do not handle redundant variables well, due to

**variance inflation**or similar issues, andas an attempt by analysts to overcome the

**curse of dimensionality**or to avoid situations where the number of variables is larger than the number of observations.

In light of the comment above, the goal of **feature selection** is to remove (and not to transform or project) any attribute that adds noise and reduces the performance of a model, that is to say, to retain a subset of the most **relevant features**^{300}, which can help create simpler models, decrease a statistical learner’s training time, and reduce overfitting.

There are various feature selection methods, typically falling in one of three families – **filter** methods, **wrapper** methods, and **embedded** methods (most of the next two sections is inspired by [6]):

**filter methods**focus on the relevance of the features, applying a specific**ranking metric**to each feature, individually. The variables that do not meet a preset benchmark^{301}are then removed from consideration, yielding a subset of the most relevant features according to the selected ranking metric; different metrics, and different thresholds, might retain different relevant features;**wrapper methods**focus on the usefulness of each feature to the task of interest (usually classification or regression), but do not consider features individually; rather, they evaluate and compare the performance of different combinations of features in order to select the best-performing subset of features, and**embedded methods**are a combination of both, using implicit metrics to evaluate the performance of various subsets.

Feature selection methods can also be categorized as **unsupervised** or **supervised**:

**unsupervised methods**determine the importance of features using only their values (with potential feature interactions), while**supervised methods**evaluate each feature’s importance in relationship with the**target feature**.

Wrapper methods are typically supervised. Unsupervised filter methods search for noisy features and include the removal of constant variables, of ID-like variables (i.e. different on all observations), or features with low variability.

### 15.3.1 Filter Methods

Filter methods evaluate features without resorting to the use of a classification/regression algorithm. These methods can either be

**univariate**, where each feature is ranked independently, or**multivariate**, where features are considered jointly.

A filter criterion is chosen based on which metric suits the data or problem best.^{302} The selected criterion is used to assign a score to, and rank, the features which are then retained or removed in order to yield a **relevant** subset of features.

Features whose score lies above (or below, as the case may be) some pre-selected threshold \(\tau\) are retained (or removed); alternatively, features whose rank lies below (or above as the case may be) some pre-selected threshold \(\nu\) are retained (or removed). Such methods are advantageous in that they are computationally efficient. They also tend to be robust against overfitting as they do not incorporate the effects of the feature subset selection on classification/regression performance.

There are a number of commonly-used filter criteria, including the **Pearson correlation coefficient**, **information gain** (or mutual information), and **relief** [6]. Throughout, let \(Y\) be the target variable (assuming that there is one), and \(X_1, \ldots, X_p\) be the predictors.

#### Pearson Correlation Coefficient

The **Pearson correlation coefficient** [302] quantifies the linear relationship between two continuous variables. For a predictor \(X_i\), the Pearson correlation coefficient between \(X_i\) and \(Y\) is \[\rho_{i}=\frac{\text{Cov}(X_i,Y)}{\sigma_{X_i}\sigma_{Y}}.\]
Features for which \(|\rho_i|\) is large (near 1) are linearly correlated with \(Y\), those for which \(|\rho_i|\approx 0\) are not linearly (or anti-linearly) correlated with \(Y\) (which could mean that they are uncorrelated with \(Y\), or that the correlation is not linear or anti-linear). Only those features with (relatively) strong linear (or anti-linear) correlation are retained.

This correlation \(\rho_i\) is only defined if both the predictor \(X_i\) and the outcome \(Y\) are numerical; there are alternatives for categorical \(X_i\) and \(Y\), or mixed categorical-numerical \(X_i\) and \(Y\) [303], [304], [305].

In order to get a better handle on what filter feature selection looks like in practice, consider the *Global Cities Index* dataset [306], which ranks prominent cities around the globe on a general scale of “Alpha”, “Beta”, “Gamma”, and “Sufficiency” (1, 2, 3, 4, respectively).

This dataset contains geographical, population, and economics data for 68 ranked cities.

```
globalcities <- data.frame(read.csv("Data/globalcities.csv", stringsAsFactors = TRUE))
colnames(globalcities)[2:7] <- c("City.Area","Metro.Area","City.Pop","Metro.Pop", "Ann.Pop.Growth", "GDPpc")
colnames(globalcities)[12:17] <- c("Higher.Ed.Insts","Life.Exp.M","Life.Exp.F","Hospitals", "Museums", "Air.Quality")
str(globalcities)
```

```
'data.frame': 68 obs. of 18 variables:
$ Rating : int 1 3 2 1 1 1 2 1 2 1 ...
$ City.Area : num 165 30.7 38.9 1569 102.6 ...
$ Metro.Area : num 807 25437 381 7762 3236 ...
$ City.Pop : num 0.76 3.54 0.66 5.72 1.62 ...
$ Metro.Pop : num 1.4 4.77 4.01 6.5 3.23 ...
$ Ann.Pop.Growth : num 0.01 0.26 0 0.03 0.01 0.04 0 0.01 0.01 0.01 ...
$ GDPpc : num 46 21.2 30.5 23.4 36.3 20.3 33.3 15.9 69.3 45.6 ...
$ Unemployment.Rate: num 0.05 0.12 0.16 0.02 0.15 0.01 0.16 0.1 0.07 0.16 ...
$ Poverty.Rate : num 0.18 0.2 0.2 0 0.2 0.01 0.22 0.22 0.17 0.26 ...
$ Major.Airports : int 1 1 1 2 1 1 2 1 1 2 ...
$ Major.Ports : int 1 0 1 1 1 0 2 0 1 1 ...
$ Higher.Ed.Insts : int 23 10 21 37 8 89 30 19 35 25 ...
$ Life.Exp.M : num 76.3 75.3 78 69 79 79 82 74.6 74.8 77 ...
$ Life.Exp.F : num 80.8 80.8 83.7 74 85.2 83 88 79.7 81.1 82.6 ...
$ Hospitals : int 7 7 23 173 45 551 79 22 12 43 ...
$ Museums : int 68 36 47 27 69 156 170 76 30 25 ...
$ Air.Quality : int 24 46 41 54 35 121 26 77 17 28 ...
$ Life.Expectancy : num 78.5 78 80.8 71.5 82.1 ...
```

Rating | City.Area | Metro.Area | City.Pop | Metro.Pop | Ann.Pop.Growth | |
---|---|---|---|---|---|---|

Min. :1.000 | Min. : 15.9 | Min. : 118 | Min. : 0.1000 | Min. : 0.460 | Min. :-0.02000 | |

1st Qu.:1.000 | 1st Qu.: 123.7 | 1st Qu.: 1616 | 1st Qu.: 0.9675 | 1st Qu.: 3.127 | 1st Qu.: 0.01000 | |

Median :1.000 | Median : 410.0 | Median : 3598 | Median : 2.9100 | Median : 6.485 | Median : 0.01000 | |

Mean :1.441 | Mean : 944.4 | Mean : 6717 | Mean : 4.8276 | Mean : 8.657 | Mean : 0.02265 | |

3rd Qu.:2.000 | 3rd Qu.:1334.7 | 3rd Qu.: 7808 | 3rd Qu.: 7.7425 | 3rd Qu.:11.660 | 3rd Qu.: 0.02000 | |

Max. :4.000 | Max. :7304.0 | Max. :31815 | Max. :23.0200 | Max. :34.990 | Max. : 0.26000 |

GDPpc | Unemployment.Rate | Poverty.Rate | Major.Airports | Major.Ports | Higher.Ed.Insts | |
---|---|---|---|---|---|---|

Min. : 0.50 | Min. :0.00000 | Min. : 0.0000 | Min. :0.000 | Min. : 0.00 | Min. : 2.00 | |

1st Qu.:21.05 | 1st Qu.:0.04000 | 1st Qu.: 0.0675 | 1st Qu.:1.000 | 1st Qu.: 1.00 | 1st Qu.: 14.00 | |

Median :39.05 | Median :0.07000 | Median : 0.1500 | Median :1.000 | Median : 1.00 | Median : 25.50 | |

Mean :35.01 | Mean :0.08603 | Mean : 0.7519 | Mean :1.618 | Mean : 1.25 | Mean : 42.24 | |

3rd Qu.:47.85 | 3rd Qu.:0.11000 | 3rd Qu.: 0.2025 | 3rd Qu.:2.000 | 3rd Qu.: 1.00 | 3rd Qu.: 52.75 | |

Max. :72.93 | Max. :0.37000 | Max. :17.3500 | Max. :5.000 | Max. :17.00 | Max. :264.00 |

Life.Exp.M | Life.Exp.F | Hospitals | Museums | Air.Quality | Life.Expectancy | |
---|---|---|---|---|---|---|

Min. :53.00 | Min. :55.00 | Min. : 2.00 | Min. : 1.00 | Min. : 12.00 | Min. :55.00 | |

1st Qu.:72.38 | 1st Qu.:76.08 | 1st Qu.: 18.75 | 1st Qu.: 18.00 | 1st Qu.: 23.00 | 1st Qu.:73.93 | |

Median :76.20 | Median :81.05 | Median : 33.50 | Median : 50.00 | Median : 35.00 | Median :78.78 | |

Mean :74.12 | Mean :78.78 | Mean : 86.57 | Mean : 64.51 | Mean : 51.62 | Mean :76.45 | |

3rd Qu.:79.00 | 3rd Qu.:83.62 | 3rd Qu.: 84.50 | 3rd Qu.: 79.50 | 3rd Qu.: 66.25 | 3rd Qu.:81.04 | |

Max. :84.90 | Max. :88.00 | Max. :658.00 | Max. :350.00 | Max. :198.00 | Max. :85.00 |

The R package `FSelector`

contains feature selection tools, including various filter methods (such as chi-squared score, consistency, various entropy-based filters, etc.). Using its filtering functions, we extract the most relevant features to the ranking of a global city (we treat the `Rating`

variable as a numerical response: is this justifiable?).

For instance, if we retain the 5 top predictors for **linear correlation** (Pearson’s correlation coefficient) with the response `Rating`

, we obtain:

```
(lincor <- FSelector::linear.correlation(formula = `Rating` ~ .,
data = globalcities))
(subset_lincor <- FSelector::cutoff.k(lincor, k = 5))
```

```
attr_importance
City.Area 0.0007479646
Metro.Area 0.0055023564
City.Pop 0.1196632421
Metro.Pop 0.2030923952
Ann.Pop.Growth 0.1935336738
GDPpc 0.2090866065
Unemployment.Rate 0.2173999333
Poverty.Rate 0.0536585758
Major.Airports 0.2263265771
Major.Ports 0.0563507487
Higher.Ed.Insts 0.0547453393
Life.Exp.M 0.1302972404
Life.Exp.F 0.1412093767
Hospitals 0.1195832079
Museums 0.1553072283
Air.Quality 0.1382099362
Life.Expectancy 0.1380603739
[1] "Major.Airports" "Unemployment.Rate" "GDPpc"
[4] "Metro.Pop" "Ann.Pop.Growth"
```

According to the top-5 linear correlation feature selection method, the best features that relate to a city’s global ranking are the number of major airports it has, its unemployment rate, its GDP per capita, its metropolitan population, and its annual population growth.^{303}

#### Mutual Information

**Information gain** is a popular **entropy-based** method of feature selection that measures the amount of dependence between features by quantifying the amount of mutual information between them. In general, this quantifies the **amount of information** obtained about a predictor \(X_i\) by observing the target feature \(Y\).

Mutual information can be expressed as \[\text{IG}(X_i;Y) = H(X_i) - H(X_i\mid Y),\] where \(H(X_i)\) is the **marginal entropy** of \(X_i\) and \(H(X_i\mid Y)\) is the **conditional entropy** of \(X_i\) given \(Y\) [307], an \[H(X_i) = \text{E}_{X_i}[-\log p(X_i)], \quad H(X_i\mid Y) = \text{E}_{(X_i,Y)}[-\log p(X_i\mid Y)],\] where \(p(X_i)\) and \(p(X_i\mid Y)\) are the probability density functions of the random variables \(X_i\) and \(X_i\mid Y\), respectively.

How is \(\text{IG}\) interpreted? Consider the following example: let \(Y\) represent the salary of an individual (continuous), \(X_1\) their hair colour (categorical), \(X_2\) their age (continuous), \(X_3\) their height (continuous), and \(X_4\) their self-reported gender (categorical). A sample of \(n=2144\) individuals is found in `demo1.csv`

.

```
salary <- data.frame(read.csv("Data/demo1.csv", stringsAsFactors = TRUE))
colnames(salary)[1] <- c("Hair")
```

Some summary statistics are shown below:

In a general population, one would expect that the distribution of salaries, among others, is likely to be fairly haphazard, and it might be hard to explain why it has the shape that it does, specifically.

```
par(mfrow=c(3,2))
plot1 <- ggplot2::ggplot(salary, ggplot2::aes(x=Hair)) +
ggplot2::geom_bar(color='red', fill='white') +
ggplot2::theme_bw()
plot2 <- ggplot2::ggplot(salary, ggplot2::aes(x=Gender)) +
ggplot2::geom_bar(color='red', fill='white') +
ggplot2::theme_bw()
plot3 <- ggplot2::ggplot(salary, ggplot2::aes(x=Age)) +
ggplot2::geom_histogram(ggplot2::aes(y=..density..),
color='red', fill='white', bins=10) +
ggplot2::geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
ggplot2::theme_bw()
plot4 <- ggplot2::ggplot(salary, ggplot2::aes(x=Height)) +
ggplot2::geom_histogram(ggplot2::aes(y=..density..),
color='red', fill='white', bins=10) +
ggplot2::geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
ggplot2::theme_bw()
plot5 <- ggplot2::ggplot(salary, ggplot2::aes(x=Salary)) +
ggplot2::geom_histogram(ggplot2::aes(y=..density..),
color='red', fill='white', bins=10) +
ggplot2::geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
ggplot2::theme_bw()
gridExtra::grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol=2)
```

Perhaps it could be explained by knowing the relationship between the salary and the other variables? It is this idea that forms the basis of mutual information feature selection.

Applying the definition, one sees that \[\begin{aligned} H(X_1) &=-\sum_{\text{colour}}p(\text{colour})\log p(\text{colour}) \\ H(X_2) &=-\int p(\text{age}) \log p(\text{age}) \, d\text{age} \\ H(X_3) &=-\int p(\text{height}) \log p(\text{height}) \, d\text{height} \\ H(X_4) &=-\sum_{\text{gender}}p(\text{gender})\log p(\text{gender})\\ H(X_1\mid Y) &=-\int p(Y)\left\{\sum_{\text{colour}}p(\text{colour}\mid Y)\log p(\text{colour}\mid Y)\right\}dY \\ H(X_2\mid Y) &=-\iint p(Y) p(\text{age}\mid Y) \log p(\text{age}\mid Y) \, d\text{age}\, dY \\ H(X_3\mid Y) &=-\iint p(Y) p(\text{ht}\mid Y) \log p(\text{ht}\mid Y) \, d\text{ht}\, dY \\ H(X_4\mid Y) &=-\int p(Y)\left\{\sum_{\text{gender}}p(\text{gender}\mid Y)\log p(\text{gender}\mid Y)\right\}\, dY\end{aligned}\]

If the theoretical distributions are known, the entropy integrals can be computed directly (or approximated using standard numerical integration methods).

Gender and hair colour can be fairly easily be modeled using multinomial distributions, but there is more uncertainty related to the numerical variables. A potential approach is to recode the continuous variables age, height, and salary as **decile variables** \(a_d\), \(h_d\), and \(Y_d\) taking values \(\{1, \ldots, 10\}\) according to which decile of the original variable the observation falls (see
Figure 15.15 for the decile breakdown).

The integrals can then be replaced by sums: \[\begin{aligned} H(X_1) &=-\sum_{\text{colour}}p(\text{colour})\log p(\text{colour}) \\ H(X_2) &\approx -\sum_{k=1}^{10} p(\text{a}_d=k) \log p(\text{a}_d=k) \\ H(X_3) &\approx -\sum_{k=1}^{10} p(\text{ht}_d=k) \log p(\text{ht}_d=k) \\ H(X_4) &=-\sum_{\text{gender}} p(\text{gender})\log p(\text{gender})\\ H(X_1\mid Y) &\approx -\sum_{j=1}^{10} p(Y_d=j)\!\!\!\sum_{c\in\text{colour}}\!\!\! p(c\mid Y_d=j)\log p(c\mid Y_d=j) \\ H(X_2\mid Y) & \approx -\sum_{j=1}^{10} p(Y_d=j)\sum_{k=1}^{10}p(a_d=k\mid Y_d=j) \log p(a_d=k\mid Y_d=j) \\ H(X_3\mid Y) &\approx -\sum_{j=1}^{10} p(Y_d=j)\sum_{k=1}^{10}p(h_d=k\mid Y_d=j) \log p(h_d=k\mid Y_d=j) \\ H(X_4\mid Y) &\approx -\sum_{j=1}^{10} p(Y_d=k)\!\!\!\sum_{g\in\text{gender}}\!\!\! p(g\mid Y_d=j)\log p(g\mid Y_d=j)\end{aligned}\]

The results are shown below (using base 10 logarithms, and rounded out to the nearest hundredth):

X | H(X) | H(X|Y) | IG(X;Y) | Ratio |
---|---|---|---|---|

Hair |
0.24 | 0.24 | 0.00 | 0.00 |

Age |
1.00 | 0.74 | 0.26 | 0.26 |

Height |
1.00 | 0.96 | 0.03 | 0.03 |

Gender |
0.30 | 0.22 | 0.08 | 0.26 |

The percentage decrease in entropy after having observed \(Y\) is shown in the column “Ratio.” Raw IG numbers would seem to suggest that Gender has a small link to Salary; the Ratio numbers suggest that this could be due to the way the Age and Height levels have been categorized (as deciles).

#### Relief

**Relief** scores (numerical) features based on the identification of feature value differences between nearest-neighbour instance pairs. If there is a feature value difference in a neighbouring instance pair **of the same class**, the score of the feature decreases; on the other hand, if there exists a feature value difference in a neighbouring instance pair with **different class values**, the feature score increases.

More specifically, let \[D = {\{(\mathbf{x}_i, y_i)\}}_{i=1}^n \subset \mathbb{R}^p \times \{\pm1\}\] be a dataset where \(\mathbf{x}_i\) is the \(i\)-th data sample and \(y_n\) is its corresponding class label. For each feature \(j\) and observation \(i\), two values are selected: the \(\textbf{near hit}\) \(H(x_{i,j})\) is the value of \(X_j\) which is nearest to \(x_{i,j}\) among all instances in the same class as \(\mathbf{x}_i\), while the \(\textbf{near miss}\) \(M(x_{i,j})\) is the value of \(X_j\) which is nearest to \(x_{i,j}\) among all instances in the opposite class of \(\mathbf{x}_i\).

The **Relief score** of the \(j^{\text{\footnotesize th}}\) feature is \[S^d_j = \frac{1}{n}\sum_{i=1}^n\left\{ d(x_{i,j},M(x_{i,j}))-d(x_{i,j},H(x_{i,j}))\right\},\] for some pre-selected distance \(d:\mathbb{R}\times\mathbb{R}\to \mathbb{R}^+_0\).

A feature for which near-hits tend to be nearer to their instances than near-misses are (i.e., for which \[d(x_{i,j},M(x_{i,j})) > d(x_{i,j},H(x_{i,j})),\] on average) will yield larger Relief scores than those for which the opposite is true. Features are deemed **relevant** when their relief score is greater than some threshold \(\tau\).

There are a variety of Relief-type measurements to accommodate potential feature interactions or multi-class problems^{304} (ReliefF), but in general Relief is noise-tolerant and robust to interactions of attributes; its effectiveness decreases for small training sets, however [308].

The Relief algorithm is also implemented in the `R`

package `CORElearn`

, as are numerous other methods:

```
CORElearn::infoCore(what="attrEval") # Classification
CORElearn::infoCore(what="attrEvalReg") # Regression
```

```
[1] "ReliefFequalK" "ReliefFexpRank" "ReliefFbestK"
[4] "Relief" "InfGain" "GainRatio"
[7] "MDL" "Gini" "MyopicReliefF"
[10] "Accuracy" "ReliefFmerit" "ReliefFdistance"
[13] "ReliefFsqrDistance" "DKM" "ReliefFexpC"
[16] "ReliefFavgC" "ReliefFpe" "ReliefFpa"
[19] "ReliefFsmp" "GainRatioCost" "DKMcost"
[22] "ReliefKukar" "MDLsmp" "ImpurityEuclid"
[25] "ImpurityHellinger" "UniformDKM" "UniformGini"
[28] "UniformInf" "UniformAccuracy" "EqualDKM"
[31] "EqualGini" "EqualInf" "EqualHellinger"
[34] "DistHellinger" "DistAUC" "DistAngle"
[37] "DistEuclid"
[1] "RReliefFequalK" "RReliefFexpRank" "RReliefFbestK"
[4] "RReliefFwithMSE" "MSEofMean" "MSEofModel"
[7] "MAEofModel" "RReliefFdistance" "RReliefFsqrDistance"
```

We start by declaring the target variable `Rating`

as a categorical variable. We also create a duplicate variable `Rating2`

, to determine the various scores directionality.

Now, let’s evaluate the predictors relevance, using `InfGain`

and `ReliefFpe`

, say:

```
InfGain.wts <- CORElearn::attrEval(Rating ~ ., globalcities.cat,
estimator="InfGain")
ReliefF.wts <- CORElearn::attrEval(Rating ~ ., globalcities.cat,
estimator="ReliefFpe")
data.frame(InfGain.wts,ReliefF.wts)
```

```
InfGain.wts ReliefF.wts
City.Area 0.05898196 0.035227591
Metro.Area 0.10501737 0.020691027
City.Pop 0.10422894 0.050779784
Metro.Pop 0.13022307 0.006038070
Ann.Pop.Growth 0.04624721 0.005785050
GDPpc 0.12909419 0.038870952
Unemployment.Rate 0.08824268 0.069113116
Poverty.Rate 0.06966089 0.004854548
Major.Airports 0.03565266 0.015932195
Major.Ports 0.04175416 0.014967595
Higher.Ed.Insts 0.02868171 0.008031587
Life.Exp.M 0.09504723 0.119356004
Life.Exp.F 0.04937724 0.073946982
Hospitals 0.18080212 0.011268651
Museums 0.10732739 0.017808737
Air.Quality 0.05838298 0.057001715
Life.Expectancy 0.07730300 0.101021497
```

For classification tasks (categorical targets), more relevant features are those for which the scores are higher (this is not necessarily the case for regression tasks, however).

There are many filter methods [4], [6], including:

other correlation metrics (Kendall, Spearman, point-biserial correlation, etc.);

other entropy-based metrics (gain ratio, symmetric uncertainty, etc.);

other relief-type algorithms (ReliefF, Relieved-F, etc.);

\(\chi^2-\)test;

ANOVA;

Fisher Score;

Gini Index;

etc.

The list is by no means exhaustive, but it provides a fair idea of the various types of filter-based feature selection metrics. The following example illustrates how some of these metrics are used to extract relevant attributes.

### 15.3.2 Wrapper Methods

**Wrapper methods** offer a powerful way to address problem of variable selection. Wrapper methods evaluate the quality of subsets of features for predicting the target output under the selected predictive algorithm and select the optimal combination (for the given training set and algorithm).

In contrast to filter methods, wrapping methods are integrated directly into the classification or clustering process (see Figure 15.16 for an illustration of this process).

Wrapper methods treats feature selection as a **search problem** in which different subsets of features are explored. This process can be computationally expensive as the size of the search space increases exponentially with the number of predictors; even for modest problems an exhaustive search can quickly become impractical.

In general, wrapper methods iterate over the following steps, until an “optimal” set of features is identified:

select a feature subset, and

evaluate the performance of the selected feature subset.

The search ends once some desired quality is reached (such as adjusted \(R^2\), accuracy, etc.). Various search methods are used to traverse the feature phase space and provide an approximate solution to the **optimal feature subset problem**, including: hill-climbing, best-first, and genetic algorithms.

Wrapper methods are not as efficient as filter methods and are not as robust against over-fitting. However, they are very effective at improving the model’s performance due to their attempt to minimize the error rate (which unfortunately can also lead to the introduction of implicit bias in the problem [6]).

### 15.3.3 Subset Selection Methods

**Stepwise selection** is a form of *Occam’s Razor*: at each step, a new feature is considered for inclusion or removal from the current features set based on some criterion (\(F-\)test, \(t-\)test, etc.). Greedy search methods such as backward elimination and forward selection have proven to be both quite robust against over-fitting and among the least computationally expensive wrapper methods.

**Backward elimination** begins with the full set of features and sequentially eliminates the least relevant ones until further removals increase the error rate of the predictive model above some utility threshold.

**Forward selection** works in reverse, beginning the search with an empty set of features and progressively adding relevant features to the ever growing set of predictors. In an ideal setting, model performance should be tested using **cross-validation**.

Stepwise selection methods are extremely common, but they have severe limitations (which are not usually addressed) [309], [310]:

the tests are biased, since they are all based on the same data;

the adjusted \(R^2\) only takes into account the number of features in the final fit, and not the degrees of freedom that have been used in the entire model;

if cross-validation is used, stepwise selection has to be repeated for each sub-model but that is not usually done, and

it represents a classic example of \(p\)-hacking.

Consequently, the use of stepwise methods is contraindicated in the machine learning context.

### 15.3.4 Regularization (Embedded) Methods

An interesting hybrid is provided by the **least absolute shrinkage and selection operator** (LASSO) and its variants, which are discussed in Section 12.2.4.

### 15.3.5 Supervised and Unsupervised Feature Selection

While feature selection methods are usually categorised as filter, wrapper, or embedded, they can also be categorised as **supervised** or **unsupervised** methods. Whether a feature selection method is supervised or not boils down to whether the labels of objects/instances are incorporated into the feature reduction process (or not).

The methods that have been described in this section were all supervised.

In unsupervised methods, feature selection is carried out based on the characteristics of the attributes, without any reference to labels or a target variable. In particular, for **clustering problems** (see Section 14), only unsupervised feature selection methods can be used [5].

Unsupervised feature selection methods include:

identifying ID-like predictors;

identifying constant (or nearly constant) predictors;

identifying predictors that are in a multicolinear relationship with other variables;

identifying clusters of predictors, etc.

### References

*Data Clustering: Algorithms and Applications*. CRC Press, 2014.

*Data Mining: The Textbook*. Cham: Springer, 2015.

*Data Classification: Algorithms and Applications*. CRC Press, 2015.

*Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis*. Springer International Publishing, 2015.

*towardsdatascience.com*, 2018.