8.7 Example: Algae Blooms
This example is based on a Case Study by L.Torgo . It provides a concrete illustration of the data preparation process on a realistic dataset:
algae_blooms.csv, which is also available at the UCI Machine Learning Repository.
The ultimate problem is to predict the occurrence of harmful algae in water samples. Torgo also uses it to highlight various aspects of data exploration, data cleaning, and
Readers who would prefer to try this example on their own are invited to skip this section and head to the first exercise of Section 8.8.
8.7.1 Problem Description
The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.
The dataset which is used to train the learning model consists of:
chemical properties of various water samples of European rivers
the quantity of seven algae in each of the samples, and
the characteristics of the collection process for each sample.
What is the data science motivation for such a model? After all, we can analyze water samples to determine if various harmful algae are present or absent.
The answer is simple: chemical monitoring is cheap and easy to automate, whereas biological analysis of samples is expensive and slow.
Another answer is that analyzing the samples for harmful content does not provide a better understanding of algae drivers: it just tells us which samples contain algae.
8.7.2 Loading the Data
Before we can take a look at the data and begin the process in earnest, we need to load it in the in the
R workspace. The dataset was downloaded from the UCI ML repository and stored in the CSV file
algae_blooms.csv (in the
It is also available in Torgo’s
DMwR package. As we will use some of its functions in this example, we will have to install and load that package itself. Unfortunately, it cannot be installed directly from CRAN with the
Instead, we suggest doing the following:
Download the package source
DMwR_0.4.1.tar.gzfrom the DMwR CRAN archive page (additional information about the package is also available there) and save it locally to some path (in this example, the file was saved to the folder
Install the following dependencies directly from CRAN:
The dependencies list might be different, based on the packages already installed locally; any eventual error message in the next step will inform you of the exact dependencies to install.
DMwRfrom the package source:
- Load the package:
We can get a sense for the data frame’s structure by calling the
'data.frame': 340 obs. of 18 variables: $ season: chr "winter" "spring" "autumn" "spring" ... $ size : chr "small" "small" "small" "small" ... $ speed : chr "medium" "medium" "medium" "medium" ... $ mxPH : num 8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ... $ mnO2 : num 9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ... $ Cl : num 60.8 57.8 40 77.4 55.4 ... $ NO3 : num 6.24 1.29 5.33 2.3 10.42 ... $ NH4 : num 578 370 346.7 98.2 233.7 ... $ oPO4 : num 105 428.8 125.7 61.2 58.2 ... $ PO4 : num 170 558.8 187.1 138.7 97.6 ... $ Chla : num 50 1.3 15.6 1.4 10.5 ... $ a1 : num 0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ... $ a2 : num 0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ... $ a3 : num 0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ... $ a4 : num 0 1.9 0 0 0 0 3.9 0 0 2.9 ... $ a5 : num 34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ... $ a6 : num 8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ... $ a7 : num 0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...
algae_blooms is a data frame with 340 observations of 18 variables each.
3 of the fields are categorical (
speed, which refer to the data collection process)
of the numerical fields, 8 have names that sound vaguely “chemical”
presumably, the remaining fields refer to the algae blooms
We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the
8.7.3 Summary and Visualization
As it happens, we are not given an awful lot of information about the dataset’s domain (I, for one, remain woefully ill-prepared to deal with matters of a chemical nature, to my eternal shame).
Data exploration, in the form of summaries and visualization, can provide a better handle on the problem at hand.
A call to the
summary function provides frequency counts for categorical variables, and 6-pt summaries for numerical variables. As a bonus, the number of missing values is also tabulated.145
IMPORTANT NOTE: we may have given you the impression (just now) that exploration is only really necessary when domain expertise escapes us. Domain expertise can help analysts frame the problem and the analysis results in the appropriate manner, but it often also gives them a false sense of security. Errors can creep anywhere – data exploration at an early stage may save you a lot of embarrassing back-tracking at a later stage.
|Length:340||Length:340||Length:340||Min. :5.600||Min. : 1.500||Min. : 0.222||Min. : 0.000||Min. : 5.00||Min. : 1.00||Min. : 1.0||Min. : 0.200||Min. : 0.00||Min. : 0.000||Min. : 0.000||Min. : 0.00||Min. : 0.000||Min. : 0.000||Min. : 0.000|
|Class :character||Class :character||Class :character||1st Qu.:7.750||1st Qu.: 7.925||1st Qu.: 10.994||1st Qu.: 1.147||1st Qu.: 37.86||1st Qu.: 13.00||1st Qu.: 40.0||1st Qu.: 2.133||1st Qu.: 1.50||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.00||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000|
|Mode :character||Mode :character||Mode :character||Median :8.045||Median : 9.700||Median : 32.470||Median : 2.356||Median : 107.36||Median : 37.24||Median : 101.5||Median : 5.111||Median : 7.10||Median : 2.800||Median : 1.400||Median : 0.00||Median : 2.200||Median : 0.000||Median : 0.000|
|NA||NA||NA||Mean :7.997||Mean : 9.157||Mean : 42.517||Mean : 3.121||Mean : 471.73||Mean : 73.09||Mean : 136.7||Mean : 12.796||Mean :16.70||Mean : 7.201||Mean : 3.904||Mean : 1.81||Mean : 5.516||Mean : 6.411||Mean : 2.206|
|NA||NA||NA||3rd Qu.:8.393||3rd Qu.:10.800||3rd Qu.: 57.750||3rd Qu.: 4.147||3rd Qu.: 244.90||3rd Qu.: 88.11||3rd Qu.: 200.2||3rd Qu.: 17.200||3rd Qu.:25.18||3rd Qu.:10.150||3rd Qu.: 4.625||3rd Qu.: 2.30||3rd Qu.: 8.000||3rd Qu.: 7.025||3rd Qu.: 2.200|
|NA||NA||NA||Max. :9.700||Max. :13.400||Max. :391.500||Max. :45.650||Max. :24064.00||Max. :1435.00||Max. :1690.0||Max. :110.456||Max. :89.80||Max. :72.600||Max. :42.800||Max. :44.60||Max. :61.100||Max. :77.600||Max. :31.600|
|NA||NA||NA||NA’s :2||NA’s :2||NA’s :16||NA’s :2||NA’s :2||NA’s :2||NA’s :7||NA’s :23||NA||NA||NA||NA||NA||NA||NA|
The chemical variables all have missing values, ranging from only 2 to 7, 16, and 23.
The observations seem fairly uniformly distributed in terms of the seasons, but large rivers and low speed rivers are not represented as often as their counterparts.
All numerical values are non-negative, which makes sense in the context of concentrations
We do not know what the range of the chemical values should take in a real-world context, but some of the maximum values seem … unrealistic (NH4!!, oPO4, a7, etc.)
Does anything else jump at you?
Of course, these summaries each apply to a single variable (1-way tables). Can we spot anything else using \(n\)-way tables (on the categorical variables)?
large medium small high 13 56 73 low 32 24 2 medium 38 56 46 autumn spring summer winter high 32 34 38 38 low 16 13 12 17 medium 32 37 36 35 large medium small autumn 19 33 28 spring 21 34 29 summer 19 36 31 winter 24 33 33 , , = high large medium small autumn 3 13 16 spring 3 14 17 summer 3 16 19 winter 4 13 21 , , = low large medium small autumn 9 6 1 spring 7 6 0 summer 6 6 0 winter 10 6 1 , , = medium large medium small autumn 7 14 11 spring 11 14 12 summer 10 14 12 winter 10 14 11
The 6-pt summary provides some information about the underlying distribution, but not much on the parametric front. A more traditional summary can be displayed using the
the categorical variables are marked by an asterisk
*; the levels are coded with an integer, and treated as numerical variables for the purpose of the analysis, so the results for these fields are meaningless
trimmedvariable refers to the trimmed mean, the mean obtained when a certain percentage of the observations are removed from both end of the spectrum (what percentage, exactly?)
madvariable refers to the median absolute deviation (from the median)
We personally find such a table hard to read and really grasp once there are more than a few variables in the dataset. Visualization comes in handy in such cases.
Basic histograms can be constructed with the
Based on this histogram, we can conclude that the underlying distribution of
mnO2 has a negative skew, say, which is confirmed by the table above.
mnO2 clearly does not follow a normal distribution (no negative value, high skew), we see that such an approximation wouldn’t be so bad compared to
\(qq-\)plots, another traditional statistical plot, can be produced with the
 69 70
Again, we can see that the normal distribution is not a good fit for
mnO2, but that it is even worse for
a1 (see below).
 49 118
Let us take a look at some of the odd values for
library(ggplot2) ggplot(algae_blooms,aes(x=factor(0),y=NH4)) + # plotting NH4 from the algae_blooms dataset ... geom_boxplot() + # as a boxplot ... geom_rug() + # with a rug on which the true values are shown ... geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ... ylab("Ammonium (NH4+)") + xlab("") + scale_x_discrete(breaks=NULL)
We see that there are a string of values falling way above the boxplot. If the underlying distribution was normal, say, these would definitely be considered outliers.
Let us investigate further.
plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1) # mean value of NH4, solid line abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3) # median value of NH4, tight dashed line
We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)
What does the boxplot above looks like without the suspected outliers?
It is a bit better, to be sure (the box structure has expanded). There still seems to be some very high values, however. Perhaps that’s to be expected? How would we find out?
Now, let us take a look at some of the algae levels.
What does that tell you? Is it hard to get a good handle on the situation because the season are out of sequential order?
We can re-arrange the factors, but it requires a bit of fancy footwork using the
fct_relevel() function, and
algae_blooms = mutate(algae_blooms, size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order speed=fct_relevel(speed,c("low","medium","high")), # ditto season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here ) ggplot(algae_blooms,aes(x=season,y=a3)) + geom_boxplot() + xlab("Season") + ylab("Algae Level a3")
We only have 1 year’s worth of data, so it might be too early to tell, but it certainly seems as though the
a3 levels decrease from spring to winter.
Violin plots are cousins to the boxplots. Can we get a bit more insight on the
This plot certainly seems to suggest that
a3 levels are cyclic, with a peak in the spring and low levels in the fall.
Let us return to
NH4 for a second to see if we can spot a link with the season (as we did for
a3). We only keep the observations for which the
NH4 value is greater than 3000, and we bin them with respect to the quartiles.
First, filter the
algae_blooms dataset to remove the 2 observations with missing values (is this always a good idea?)
Next we remove the 11 observations for which
NH4 > 3000 (again, based on the mean + sd “argument”)
Finally, we add a new variable to tell us in which quartile the NH4 value falls.
We can now use the new
q.NH4 variable to make multi-variate comparisons, say between
That seems decidedly odd… why are we seeing missing values here? Have we not just removed the NAs? Let us delve in a bit deeper.
It seems as though the quartiles do not include their lower bound. We can remedy the situation by including an additional parameter in the
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) |> filter(NH4<3000) |> mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE)) ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) + geom_point() + facet_wrap(~q.NH4) + ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")
The NAs have disappeared from the graph. In any case, it seems as though the
a1 levels are inversely correlated with the
NH4 levels but that the season does not have much of an effect (although we would need more work to conclude this with any degree of certainty).
We can create a similar graph for
a3 instead of
8.7.4 Data Cleaning
We found some potential anomalies in the data when we explored the dataset (although we are electing to keep them in the dataset for the time being as we lack the domain expertise to make a reasonable decision on that front); now let us take some time to clean the data to some extent.
Anomalies come in various flavours; we have already explored some potential outlying behaviour, in this section we handle missing values.
complete.cases lists the observations for which every field is present (note that it says nothing about the validity of the case).
The vast majority of observations do not have missing cases, but a few still do. Is there anything special about them? Are the values missing completely at random?
'data.frame': 34 obs. of 18 variables: $ season: Factor w/ 4 levels "spring","summer",..: 3 1 4 4 1 3 1 2 3 1 ... $ size : Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ... $ speed : Factor w/ 3 levels "low","medium",..: 3 3 1 3 2 2 3 3 2 2 ... $ mxPH : num 6.8 8 NA 6.6 5.6 5.7 6.6 6.6 6.6 6.5 ... $ mnO2 : num 11.1 NA 12.6 10.8 11.8 10.8 9.5 10.8 11.3 10.4 ... $ Cl : num 9 1.45 9 NA NA NA NA NA NA NA ... $ NO3 : num 0.63 0.81 0.23 3.25 2.22 ... $ NH4 : num 20 10 10 10 5 10 20 10 10 10 ... $ oPO4 : num 4 2.5 5 1 1 1 1 2 1 2 ... $ PO4 : num NA 3 6 6.5 1 4 6 11 6 14 ... $ Chla : num 2.7 0.3 1.1 NA NA NA NA NA NA NA ... $ a1 : num 30.3 75.8 35.5 24.3 82.7 16.8 46.8 46.9 47.1 66.9 ... $ a2 : num 1.9 0 0 0 0 4.6 0 0 0 0 ... $ a3 : num 0 0 0 0 0 3.9 0 0 0 0 ... $ a4 : num 0 0 0 0 0 11.5 28.8 13.4 0 0 ... $ a5 : num 2.1 0 0 0 0 0 0 0 0 0 ... $ a6 : num 1.4 0 0 0 0 0 0 0 1.2 0 ... $ a7 : num 2.1 0 0 0 0 0 0 0 0 0 ...
|spring: 7||small :26||low : 4||Min. :5.600||Min. : 5.700||Min. : 0.222||Min. : 0.2300||Min. : 5.00||Min. : 1.000||Min. : 1.00||Min. : 0.30||Min. : 0.00||Min. : 0.000||Min. : 0.000||Min. : 0.000||Min. : 0.000||Min. : 0.000||Min. : 0.000|
|summer: 6||medium: 1||medium:13||1st Qu.:6.600||1st Qu.: 9.275||1st Qu.: 4.562||1st Qu.: 0.8025||1st Qu.: 10.00||1st Qu.: 1.000||1st Qu.: 6.00||1st Qu.: 1.78||1st Qu.:16.80||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000|
|autumn: 8||large : 7||high :17||Median :7.225||Median :10.800||Median : 9.027||Median : 1.4440||Median : 11.84||Median : 3.667||Median : 10.83||Median : 4.00||Median :30.30||Median : 0.000||Median : 0.000||Median : 0.000||Median : 0.000||Median : 0.000||Median : 0.000|
|winter:13||NA||NA||Mean :7.344||Mean :10.192||Mean :19.312||Mean : 2.1684||Mean : 62.03||Mean : 25.676||Mean : 34.58||Mean :13.97||Mean :36.02||Mean : 4.503||Mean : 1.418||Mean : 2.335||Mean : 1.544||Mean : 1.138||Mean : 1.676|
|NA||NA||NA||3rd Qu.:8.000||3rd Qu.:11.300||3rd Qu.:25.238||3rd Qu.: 2.5725||3rd Qu.: 46.38||3rd Qu.: 20.250||3rd Qu.: 19.23||3rd Qu.:12.22||3rd Qu.:54.38||3rd Qu.: 3.400||3rd Qu.: 1.125||3rd Qu.: 1.975||3rd Qu.: 0.900||3rd Qu.: 0.000||3rd Qu.: 1.575|
|NA||NA||NA||Max. :9.700||Max. :12.600||Max. :71.000||Max. :11.0200||Max. :500.00||Max. :295.667||Max. :380.00||Max. :68.05||Max. :83.00||Max. :36.500||Max. :14.600||Max. :28.800||Max. :21.100||Max. :14.500||Max. :28.000|
|NA||NA||NA||NA’s :2||NA’s :2||NA’s :16||NA’s :2||NA’s :2||NA’s :2||NA’s :7||NA’s :23||NA||NA||NA||NA||NA||NA||NA|
Right off the bat, missing cases seem to be over-represented in
small rivers and under-represented in
low-speed rivers. But upon further investigation (that is, comparing with the original dataset), we suspect that the under-representation of
low-speed rivers is not problematic as it is in-line with the numbers in the larger dataset (by which we mean that
low-speed rivers don’t seem to have a systematic missing value problem).
Let us assume for now (in the interest of efficiency) that the fact that small rivers have a lot of missing cases (mostly
Chla) is also not a problem146
The bulk of the missing values seem to come from either
PO4. There is also a consistent 2 missing values across the board, but we cannot use the summary output to determine if they arise from the same two observations.
Which observations have missing
While these observations also have missing values in other fields, they do have some non-missing fields as well. But they are both missing 6 of the predictor variables. How useful could they be in training a predictive model? (that depends on the model, of course….)
We can write a function that will compute how many missing cases there are for each observations.
 62 199
Most observations have no missing cases, which is great news. There are a few with 1 or 2, but observations 62 and 199 are wild cards, with 6 missing predictors (out of 11). Based on the small number of such wild cards, we elect to remove them from the analysis.
If you decide to remove observations for any reason whatsoever, you need to document the process that lead you to eliminate it, and make that available.
Why do we remove the ones with 6 missing cases, but not the ones with 2 missing cases? If there’d been observations with 4 missing cases, what would you have done? What factors influence your decision?
Our new dataset still contains observations with missing cases, however.
What can we do with the other observations for which values are missing?
One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with another field. That way, if there is a missing value in the first field, the second could be used to impute it.
- this approach only works for variables that are linearly correlated to a single other variable. Non-linear correlations and multi-variate associations will not be uncovered.
The correlation between
PO4 (which has missing cases) and
oPO4 (which does not, anymore) is clear. What is the nature of the relation? We use the set of complete cases to find it.
Call: lm(formula = PO4 ~ oPO4, data = algae_blooms.nona) Coefficients: (Intercept) oPO4 51.811 1.203
The regression function is
oPO4 (we are not particularly interested in the fit statistics at this point).
What are the observations for which
PO4 is missing?
 28 221 291 326 331 335
We can use the regression function to impute the missing
We can clearly see that no values of
PO4 are missing anymore.
That takes care of the missing values with strong linear correlation to another field. Where do we stand now?
|spring:84||small :120||low : 58||Min. :5.600||Min. : 1.500||Min. : 0.222||Min. : 0.000||Min. : 5.00||Min. : 1.00||Min. : 1.00||Min. : 0.200||Min. : 0.00||Min. : 0.000||Min. : 0.000||Min. : 0.000||Min. : 0.000||Min. : 0.000||Min. : 0.0|
|summer:85||medium:136||medium:138||1st Qu.:7.750||1st Qu.: 8.000||1st Qu.: 10.994||1st Qu.: 1.147||1st Qu.: 37.86||1st Qu.: 13.00||1st Qu.: 40.75||1st Qu.: 2.133||1st Qu.: 1.50||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.000||1st Qu.: 0.0|
|autumn:80||large : 82||high :142||Median :8.055||Median : 9.700||Median : 32.470||Median : 2.356||Median : 107.36||Median : 37.24||Median : 104.86||Median : 5.111||Median : 7.10||Median : 2.800||Median : 1.400||Median : 0.000||Median : 2.200||Median : 0.000||Median : 0.0|
|winter:89||NA||NA||Mean :8.002||Mean : 9.161||Mean : 42.517||Mean : 3.121||Mean : 471.73||Mean : 73.09||Mean : 166.18||Mean : 12.796||Mean :16.74||Mean : 7.207||Mean : 3.917||Mean : 1.812||Mean : 5.548||Mean : 6.438||Mean : 2.2|
|NA||NA||NA||3rd Qu.:8.400||3rd Qu.:10.800||3rd Qu.: 57.750||3rd Qu.: 4.147||3rd Qu.: 244.90||3rd Qu.: 88.11||3rd Qu.: 206.12||3rd Qu.: 17.200||3rd Qu.:25.32||3rd Qu.:10.025||3rd Qu.: 4.675||3rd Qu.: 2.300||3rd Qu.: 8.000||3rd Qu.: 7.075||3rd Qu.: 2.2|
|NA||NA||NA||Max. :9.700||Max. :13.400||Max. :391.500||Max. :45.650||Max. :24064.00||Max. :1435.00||Max. :1777.93||Max. :110.456||Max. :89.80||Max. :72.600||Max. :42.800||Max. :44.600||Max. :61.100||Max. :77.600||Max. :31.6|
|NA||NA||NA||NA’s :2||NA’s :1||NA’s :14||NA||NA||NA||NA||NA’s :21||NA||NA||NA||NA||NA||NA||NA|
There aren’t as many missing values as before, but we still have some. And we have exhausted the correlation trick. What can we do?
There are many ways to tackle the problem, but we will use \(k\)NN imputation. The principle is simple:
Using some similarity/distance metric (typically based on the Euclidean distance between points), identify the \(k\) nearest (complete) neighbours of each observation with a missing case.
Compute the mean value of the missing case in the \(k-\)group of complete observations, and use that value as the imputed value.
- As we have seen when we were discussing, we often suggest scaling the data when dealing with distance metrics. We elected not to scale the data explicitly here. How much of an effect can that have?
- We are going to be using
DMwRs implementation of
knnImputation()(below). How would you go about determining if the routine scales the data internally?
Registered S3 method overwritten by 'quantmod': method from as.zoo.data.frame zoo
Sure enough, there are no further observations with incomplete cases.
8.7.5 Principal Components
Principal components analysis (PCA) is typically used on the (numeric) predictor variables. There are methods that can be used to combine numeric and categorical variables, but for the purposes of this example, we will simply ignore the categorical fields.
We can scale the data frame using the
scale() function in
Notice the different values in the dataset. The principal components are obtained via the
Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 1.5334279 1.2022703 1.0900775 0.9329798 0.87500293 Proportion of Variance 0.2947973 0.1812179 0.1489744 0.1091293 0.09598775 Cumulative Proportion 0.2947973 0.4760152 0.6249896 0.7341189 0.83010663 Comp.6 Comp.7 Comp.8 Standard deviation 0.8081581 0.65499534 0.52248196 Proportion of Variance 0.0818822 0.05378649 0.03422468 Cumulative Proportion 0.9119888 0.96577532 1.00000000
If we can live with 75% of the variability in the numerical component of the predictors being explained by principal components, than we can reduce the dataset dimension by 4.
season size speed Comp.1 Comp.2 Comp.3 Comp.4 a1 a2 1 winter small medium 1.0634865 0.2877982 1.6268674 0.26280655 0.0 0.0 2 spring small medium 2.1808329 0.5908937 -2.1258588 1.07795406 1.4 7.6 3 autumn small medium 0.2097857 -0.4229638 0.6216107 0.52245855 3.3 53.6 4 spring small medium 0.5213826 0.7948266 -1.0031723 -1.46518459 3.1 41.0 5 autumn small medium 0.6477956 -0.8568203 1.1386691 -0.78026048 9.2 2.9 6 winter small high 0.1709107 -0.7049677 2.4338593 0.02250525 15.1 14.6 a3 a4 a5 a6 1 0.0 0.0 34.2 8.3 2 4.8 1.9 6.7 0.0 3 1.9 0.0 0.0 0.0 4 18.9 0.0 1.4 0.0 5 7.5 0.0 7.5 4.1 6 1.4 0.0 22.5 12.6
Whether this reduction proves useful or not will ultimately depend on what we would like to accomplish with the data; we will study this dataset again in the Regression and Value Estimation module.