## 8.6 Data Transformations

History is the transformation of tumultuous conquerors into silent footnotes. [P. Eldridge]

This **crucial** step is often neglected or omitted altogether. Various transformation methods are available, depending on the analysts’ needs and data types, including:

**standardization**and**unit conversion**, which put the dataset’s variables on an equal footing – a requirement for basic comparison tasks and more complicated problems of clustering and similarity matching;**normalization**, which attempts to force a variable into a normal distribution – an assumption which must be met in order to use number of traditional analysis methods, such as ANOVA or regression analysis, and**smoothing methods**, which help remove unwanted noise from the data, but at a price – perhaps removing natural variance in the data.

Another type of data transformation is pre-occupied with the concept of **dimensionality reduction**. There are many advantages to working with low-dimensional data [157]:

**visualization methods**of all kinds are available to extract and present insights out of such data;high-dimensional datasets are subject to the so-called

**curse of dimensionality**, which asserts (among other things) that multi-dimensional spaces are vast, and when the number of features in a model increases, the number of observations required to maintain predictive power also increases, but at a**substantially higher rate**(see Figure 8.10),

- another consequence of the curse is that in high-dimension sets, all observations are roughly
**dissimilar**to one another – observations tend to be nearer the dataset’s boundaries than they are to one another.

Dimension reduction techniques such as **principal component analysis**, **independent component analysis**, and **factor analysis** (for numerical data), or **multiple correspondence analysis** (for categorical data) project multi-dimensional datasets onto low-dimensional but high information spaces (the so-called **Manifold Hypothesis**); feature selection techniques (including the popular family of **regularization methods**) pick an optimal subset of variables with which to accomplish tasks (according to some criteria).

Some information is necessarily lost in the process, but in many instances the drain can be kept under control and the gains made by working with smaller datasets can offset the losses of completeness [157]. We will have more to say on the topic at a later stage.

### 8.6.1 Common Transformations

Models often require that certain data assumptions be met. For instance, ordinary least square regression assumes:

that the response variable is a

**linear combination**of the predictors;**constant**error variance;**uncorrelated residuals**, which may or may not be statistically independent,etc.

In reality, it is rare that raw data meets all these requirements, but that does not necessarily mean that we need to abandon the model – an **invertible** sequence of data transformations may produce a derived data set which *does* meet the requirements, allowing the consultant to draw conclusions about the original data.

In the regression context, invertibility is guaranteed by **monotonic** transformations: identity, logarithmic, square root, inverse (all members of the power transformations), exponential, etc.

These transformations are illustrated below on a subset of the BUPA liver disease dataset [159].

There are rules of thumb and best practices to transform data, but analysts should not discount the importance of exploring the data visually before making a choice.

Transformations on the **predictors** \(X\) may be used to achieve the **linearity assumption**, but they usually come at a price – correlations are not preserved by such transformations, for instance (although that may also occur with other transformations too).

Transformations on the target \(Y\) can help with **non-normality** of residuals and **non-constant variance** of error terms.

Note that transformations can be applied **both** to the target variable or the predictors: as an example, if the linear relationship between two variables \(X\) and \(Y\) is expressed as \(Y=a+bX\), then a unit increase in \(X\) is associated with an average of \(b\) units in \(Y\).

But a better fit might be provided by either of \[\log Y = a+bX,\quad Y=a+b\log X,\quad \mbox{or}\quad \log Y = a+b\log X,\] for which:

a unit increase in \(X\) is associated with an average \(b\%\) increase in \(Y\);

a \(1\%\) increase in \(X\) is associated with an average \(0.01b\) unit increase in \(Y\), and

a \(1\%\) increase in \(X\) is associated with a \(b\%\) increase in \(Y\), respectively.

### 8.6.2 Box-Cox Transformations

The choice of transformation is often as much of an art as it is a science.

There is a common framework, however, that provides the optimal transformation, in a sense. Consider the task of predicting the target \(Y\) with the help of the predictors \(X_j\), \(j=1,\ldots, p\). The usual model takes the form \[y_i=\sum_{j=1}^p\beta_jX_{x,i}+\varepsilon_i,\quad i=1,\ldots, n.\]

If the residuals are skewed, or their variance is not constant, or the trend itself does not appear to be linear, a power transformation might be preferable, but if so, which one? The **Box-Cox transformation** \(y_i\mapsto y'_i(\lambda)\), \(y_i>0\) is defined by \[y'_i(\lambda)=\begin{cases}(y_1 \ldots y_n)^{1/n}\ln y_i, \text{if }\lambda=0 \\ \frac{y_i^{\lambda}-1}{\lambda}(y_1 \ldots y_n)^{\frac{1-\lambda}{n}}, \text{if }\lambda\neq 0 \end{cases};\] variants allow for the inclusion of a shift parameter \(\alpha>0\), which extends the transformation to \(y_i>-\alpha.\)

The **suggested** choice of \(\lambda\) is the value that maximizes the log-likelihood \[\mathcal{L}=-\frac{n}{2}\log\left(\frac{2\pi\hat{\sigma}^2}{(y_1 \ldots y_n)^{2(\lambda-1)/n}}+1\right).\]

For instance, the following code shows the effect of the Box-Cox transformation on the linear fit of GAMMAGT against SGPT in the BUPA dataset.

```
library(kerndwd)
data(BUPA)
# linear regression for untransformed model
model <- lm(BUPA$X[,5] ~ BUPA$X[,3])
plot(BUPA$X[,3],BUPA$X[,5], main="Scatterplot of a subset of the BUPA dataset",
xlab="alamine aminotransferase (SGPT)",
ylab="gamma-glutamyl transpeptidase (GAMMAGT)")
abline(a=model[[1]][1], b=model[[1]][2], col="red")
# q-q plot for untransformed model
qqnorm(model$residuals)
qqline(model$residuals, col="blue")
# linear regression for Box-Cox transformed model
library(MASS)
box.cox <- boxcox(BUPA$X[,5] ~ BUPA$X[,3])
(lambda <- box.cox$x[which.max(box.cox$y)])
expr <- vector("expression", 1)
expr <- bquote((GAMMAGT^.(lambda)-1)/.(lambda))
box.cox.Y <- (BUPA$X[,5]^lambda-1)/lambda
bc.model <- lm(box.cox.Y ~ BUPA$X[,3])
plot(BUPA$X[,3],box.cox.Y, main="Scatterplot of a subset of the BUPA dataset",
xlab="alamine aminotransferase (SGPT)",
ylab=expr)
abline(a=bc.model[[1]][1], b=bc.model[[1]][2], col="red")
# q-q plot for Box-Cox transformed model
qqnorm(bc.model$residuals)
qqline(bc.model$residuals, col="blue")
```

`[1] -0.1818182`

There might be theoretical rationales which favour a particular choice of \(\lambda\) – these are not to be ignored. It is also important to produce a residual analysis, as the best Box-Cox choice does not necessarily meet all the least squares assumptions.

Finally, it is important to remember that the resulting parameters have the least squares property **only with respect to the transformed data points** (in other words, the inverse transformation has to be applied to the results before we can make interpretations about the original data).

In the BUPA example, the corresponding curve in the untransformed space is:

```
plot(BUPA$X[,3],BUPA$X[,5], main="Scatterplot of a subset of the BUPA dataset",
xlab="alamine aminotransferase (SGPT)",
ylab="gamma-glutamyl transpeptidase (GAMMAGT)")
df <- data.frame(order(BUPA$X[,3]),(lambda * (bc.model[[1]][[1]][1] + bc.model[[1]][[2]][1] * order(BUPA$X[,3])) + 1)^(1/lambda))
abline(a=model[[1]][1], b=model[[1]][2], col="red")
points(df, col='blue', pch=1)
legend("bottomright", legend=c("Regular LS", "Box-Cox LS"),
col=c("red", "blue"), lty=1:2, cex=0.8)
```

### 8.6.3 Scaling

Numeric variables may have different scales (weights and heights, for instance). Since the variance of a large-range variable is typically greater than that of a small-range variable, leaving the data **unscaled** may introduce biases, especially when using unsupervised methods (see Machine Learning 101).

It could also be the case that it is the relative positions (or rankings) which is of importance, in which case it could become important to look at relative distances between levels:

**standardisation**creates a variable with mean 0 and standard deviations 1: \[Y_i=\frac{X_i-\overline{X}}{s_X},\]**normalization**creates a variable in the range \([0,1]\): \[Y_i=\frac{X_i-\min\{X_k\}}{\max \{X_k\}- \min \{X_k\}}.\]

These are not the only options. Different schemes can lead to different outputs.

### 8.6.4 Discretizing

In order to reduce computational complexity, a numeric variable may need to be replaced with an **ordinal** variable (*height* values could be replaced by the qualitative “*short*”, “*average*”, and “*tall*”, for instance.

Of course, what these terms represent depend on the context; Canadian short and Bolivian tall may be fairly commensurate, to revisit the example at the start of the preceding section.

It is far from obvious how to determine the bins’ limits – **domain expertise** can help, but it could introduce unconscious bias to the analyses. In the absence of such expertise, limits can be set so that either the bins each:

contain (roughly) the same

**number of observations**;have the same

**width**, orthe performance of some modeling tool is maximized.

Again, various choices may lead to different outputs.

### 8.6.5 Creating Variables

Finally, it is possible that new variables may need to be introduced (in contrast with dimensionality reduction). These new variables may arise:

as

**functional relationships**of some subset of available features (introducing powers of a feature, or principal components, say);because the modeling tool may require

**independence of observations**or**independence of features**(in order to remove multicollinearity, for instance), orto simplify the analysis by looking at

**aggregated summaries**(often used in text analysis).

There is no limit to the number of new variables that can be added to a dataset – but consultants should strive for **relevant additions**.

### References

*Data Science Report Series*, 2020.