8.6 Data Transformations

History is the transformation of tumultuous conquerors into silent footnotes. [P. Eldridge]

This crucial step is often neglected or omitted altogether. Various transformation methods are available, depending on the analysts’ needs and data types, including:

• standardization and unit conversion, which put the dataset’s variables on an equal footing – a requirement for basic comparison tasks and more complicated problems of clustering and similarity matching;

• normalization, which attempts to force a variable into a normal distribution – an assumption which must be met in order to use number of traditional analysis methods, such as ANOVA or regression analysis, and

• smoothing methods, which help remove unwanted noise from the data, but at a price – perhaps removing natural variance in the data.

Another type of data transformation is pre-occupied with the concept of dimensionality reduction. There are many advantages to working with low-dimensional data [157]:

• visualization methods of all kinds are available to extract and present insights out of such data;

• high-dimensional datasets are subject to the so-called curse of dimensionality, which asserts (among other things) that multi-dimensional spaces are vast, and when the number of features in a model increases, the number of observations required to maintain predictive power also increases, but at a substantially higher rate (see Figure 8.10),

• another consequence of the curse is that in high-dimension sets, all observations are roughly dissimilar to one another – observations tend to be nearer the dataset’s boundaries than they are to one another.

Dimension reduction techniques such as principal component analysis, independent component analysis, and factor analysis (for numerical data), or multiple correspondence analysis (for categorical data) project multi-dimensional datasets onto low-dimensional but high information spaces (the so-called Manifold Hypothesis); feature selection techniques (including the popular family of regularization methods) pick an optimal subset of variables with which to accomplish tasks (according to some criteria).

Some information is necessarily lost in the process, but in many instances the drain can be kept under control and the gains made by working with smaller datasets can offset the losses of completeness [157]. We will have more to say on the topic at a later stage.

8.6.1 Common Transformations

Models often require that certain data assumptions be met. For instance, ordinary least square regression assumes:

• that the response variable is a linear combination of the predictors;

• constant error variance;

• uncorrelated residuals, which may or may not be statistically independent,

• etc.

In reality, it is rare that raw data meets all these requirements, but that does not necessarily mean that we need to abandon the model – an invertible sequence of data transformations may produce a derived data set which does meet the requirements, allowing the consultant to draw conclusions about the original data.

In the regression context, invertibility is guaranteed by monotonic transformations: identity, logarithmic, square root, inverse (all members of the power transformations), exponential, etc.

These transformations are illustrated below on a subset of the BUPA liver disease dataset [159].

There are rules of thumb and best practices to transform data, but analysts should not discount the importance of exploring the data visually before making a choice.

Transformations on the predictors $$X$$ may be used to achieve the linearity assumption, but they usually come at a price – correlations are not preserved by such transformations, for instance (although that may also occur with other transformations too).

Transformations on the target $$Y$$ can help with non-normality of residuals and non-constant variance of error terms.

Note that transformations can be applied both to the target variable or the predictors: as an example, if the linear relationship between two variables $$X$$ and $$Y$$ is expressed as $$Y=a+bX$$, then a unit increase in $$X$$ is associated with an average of $$b$$ units in $$Y$$.

But a better fit might be provided by either of $\log Y = a+bX,\quad Y=a+b\log X,\quad \mbox{or}\quad \log Y = a+b\log X,$ for which:

• a unit increase in $$X$$ is associated with an average $$b\%$$ increase in $$Y$$;

• a $$1\%$$ increase in $$X$$ is associated with an average $$0.01b$$ unit increase in $$Y$$, and

• a $$1\%$$ increase in $$X$$ is associated with a $$b\%$$ increase in $$Y$$, respectively.

8.6.2 Box-Cox Transformations

The choice of transformation is often as much of an art as it is a science.

There is a common framework, however, that provides the optimal transformation, in a sense. Consider the task of predicting the target $$Y$$ with the help of the predictors $$X_j$$, $$j=1,\ldots, p$$. The usual model takes the form $y_i=\sum_{j=1}^p\beta_jX_{x,i}+\varepsilon_i,\quad i=1,\ldots, n.$

If the residuals are skewed, or their variance is not constant, or the trend itself does not appear to be linear, a power transformation might be preferable, but if so, which one? The Box-Cox transformation $$y_i\mapsto y'_i(\lambda)$$, $$y_i>0$$ is defined by $y'_i(\lambda)=\begin{cases}(y_1 \ldots y_n)^{1/n}\ln y_i, \text{if }\lambda=0 \\ \frac{y_i^{\lambda}-1}{\lambda}(y_1 \ldots y_n)^{\frac{1-\lambda}{n}}, \text{if }\lambda\neq 0 \end{cases};$ variants allow for the inclusion of a shift parameter $$\alpha>0$$, which extends the transformation to $$y_i>-\alpha.$$

The suggested choice of $$\lambda$$ is the value that maximizes the log-likelihood $\mathcal{L}=-\frac{n}{2}\log\left(\frac{2\pi\hat{\sigma}^2}{(y_1 \ldots y_n)^{2(\lambda-1)/n}}+1\right).$

For instance, the following code shows the effect of the Box-Cox transformation on the linear fit of GAMMAGT against SGPT in the BUPA dataset.

library(kerndwd)
data(BUPA)

# linear regression for untransformed model
model <- lm(BUPA$X[,5] ~ BUPA$X[,3])

plot(BUPA$X[,3],BUPA$X[,5], main="Scatterplot of a subset of the BUPA dataset",
xlab="alamine aminotransferase (SGPT)",
ylab="gamma-glutamyl transpeptidase (GAMMAGT)")
abline(a=model[[1]][1], b=model[[1]][2], col="red")

# q-q plot for untransformed model
qqnorm(model$residuals) qqline(model$residuals, col="blue")

# linear regression for Box-Cox transformed model
library(MASS)
box.cox <- boxcox(BUPA$X[,5] ~ BUPA$X[,3])
(lambda <- box.cox$x[which.max(box.cox$y)])

expr <- vector("expression", 1)
expr <- bquote((GAMMAGT^.(lambda)-1)/.(lambda))

box.cox.Y <- (BUPA$X[,5]^lambda-1)/lambda bc.model <- lm(box.cox.Y ~ BUPA$X[,3])

plot(BUPA$X[,3],box.cox.Y, main="Scatterplot of a subset of the BUPA dataset", xlab="alamine aminotransferase (SGPT)", ylab=expr) abline(a=bc.model[[1]][1], b=bc.model[[1]][2], col="red") # q-q plot for Box-Cox transformed model qqnorm(bc.model$residuals)
qqline(bc.model$residuals, col="blue") [1] -0.1818182 There might be theoretical rationales which favour a particular choice of $$\lambda$$ – these are not to be ignored. It is also important to produce a residual analysis, as the best Box-Cox choice does not necessarily meet all the least squares assumptions. Finally, it is important to remember that the resulting parameters have the least squares property only with respect to the transformed data points (in other words, the inverse transformation has to be applied to the results before we can make interpretations about the original data). In the BUPA example, the corresponding curve in the untransformed space is: plot(BUPA$X[,3],BUPA$X[,5], main="Scatterplot of a subset of the BUPA dataset", xlab="alamine aminotransferase (SGPT)", ylab="gamma-glutamyl transpeptidase (GAMMAGT)") df <- data.frame(order(BUPA$X[,3]),(lambda * (bc.model[[1]][[1]][1] + bc.model[[1]][[2]][1] * order(BUPA\$X[,3])) + 1)^(1/lambda))
abline(a=model[[1]][1], b=model[[1]][2], col="red")
points(df, col='blue', pch=1)
legend("bottomright", legend=c("Regular LS", "Box-Cox LS"),
col=c("red", "blue"), lty=1:2, cex=0.8)

8.6.3 Scaling

Numeric variables may have different scales (weights and heights, for instance). Since the variance of a large-range variable is typically greater than that of a small-range variable, leaving the data unscaled may introduce biases, especially when using unsupervised methods (see Machine Learning 101).

It could also be the case that it is the relative positions (or rankings) which is of importance, in which case it could become important to look at relative distances between levels:

• standardisation creates a variable with mean 0 and standard deviations 1: $Y_i=\frac{X_i-\overline{X}}{s_X},$

• normalization creates a variable in the range $$[0,1]$$: $Y_i=\frac{X_i-\min\{X_k\}}{\max \{X_k\}- \min \{X_k\}}.$

These are not the only options. Different schemes can lead to different outputs.

8.6.4 Discretizing

In order to reduce computational complexity, a numeric variable may need to be replaced with an ordinal variable (height values could be replaced by the qualitative “short”, “average”, and “tall”, for instance.

Of course, what these terms represent depend on the context; Canadian short and Bolivian tall may be fairly commensurate, to revisit the example at the start of the preceding section.

It is far from obvious how to determine the bins’ limits – domain expertise can help, but it could introduce unconscious bias to the analyses. In the absence of such expertise, limits can be set so that either the bins each:

• contain (roughly) the same number of observations;

• have the same width, or

• the performance of some modeling tool is maximized.

Again, various choices may lead to different outputs.

8.6.5 Creating Variables

Finally, it is possible that new variables may need to be introduced (in contrast with dimensionality reduction). These new variables may arise:

• as functional relationships of some subset of available features (introducing powers of a feature, or principal components, say);

• because the modeling tool may require independence of observations or independence of features (in order to remove multicollinearity, for instance), or

• to simplify the analysis by looking at aggregated summaries (often used in text analysis).

There is no limit to the number of new variables that can be added to a dataset – but consultants should strive for relevant additions.

References

[157]
O. Leduc, A. Macfie, A. Maheshwari, M. Pelletier, and P. Boily, Data Science Report Series, 2020.
[158]
simplystatistics.org,
[159]
D. Dua and C. Graff, “Liver disorders dataset at the UCI machine learning repository.” University of California, Irvine, School of Information; Computer Sciences, 2017.