9.1 Data and Charts

As data scientist Damian Mingle once put it, modern data analysis is a different beast:

“Discovery is no longer limited by the collection and processing of data, but rather management, analysis, and visualization. [160]

What can be done with the data, once it has been collected/processed?

Two suggestions come to mind:

  • analysis is the process by which we extract actionable insights from the data (this process is discussed in later subsections), while

  • visualization is the process of presenting data and analysis outputs in a visual format; visualization of data prior to analysis can help simplify the analytical process; post-analysis, it allows for the results to be communicated to various stakeholders.

In this section, we focus on important visualization concepts and methods; we shall provide examples of data displays to illustrate the various possibilities that might be produced by the data presentation component of a data analysis system.

9.1.1 Pre-Analysis Uses

Even before the analytical stage is reached, data visualization can be used to set the stage for analysis by:

  • detecting invalid entries and outliers;

  • shaping the data transformations (binning, standardization, dimension reduction, etc.);

  • getting a sense for the data (data analysis as an art form, exploratory analysis), and

  • identifying hidden data structures (clustering, associations, patterns which may inform the next stage of analysis, etc.).

9.1.2 Presenting Results

The crucial element of data presentations is that they need to help convey the insight (or the message); they should be clear, engaging, and (more importantly) readable. Our ability to think of questions (and to answer them) is in some sense limited by what we can visualize.

There is always a risk that if certain types of visualization techniques dominate in evidence presentations, the kinds of questions that are particularly well-suited to providing data for these techniques will come to dominate the landscape, which will then affect data collection techniques, data availability, future interest, and so forth.

Generating Ideas and Insights

In Beautiful Evidence [161], E.explains that evidence is presented to assist our thinking processes. He further suggests that there is a symmetry to visual displays of evidence – that visualization consumers should be seeking exactly (and explicitly) what the visualization producers should be providing, namely:

  • meaningful comparisons;

  • causal networks and underlying structure;

  • multivariate links;

  • integrated and relevant data, and

  • a primary focus on content.

More details can be found in Fundamental Principles of Analytical Design.

Selecting a Chart Type

The choice of visualization methods is strongly dependent on the analysis objective, that is, on the questions that need to be answered. Presentation methods should not be selected randomly (or simply from a list of easily-produced templates) [143].

In Figure 9.1 below, F. Ruys suggests various types of visual displays that can be used, depending on the objective:

  • who is involved?

  • where is the situation taking place?

  • when is it happening?

  • what is it about?

  • how/why does it work?

  • how much?

Data visualization suggestions, by type of question

Figure 9.1: Data visualization suggestions, by type of question (F. Ruys, Vizualism.nl).

A general dashboard should at least be able to produce the following types of display:

  • charts – comparison and relation (scatterplots, bubble charts, parallel coordinate charts, decision trees, cluster plots, trend plots)

  • choropleth maps (heat maps, classification maps)

  • network diagrams and connection maps (association rule networks, phrase nets)

  • univariate diagrams (word clouds, box plots, histograms)

9.1.3 Multivariate Elements in Charts

At most two fields can be represented by position in the plane. How can we then represent other crucial elements on a flat computer screen?

Potential solutions include:

  • third dimension

  • marker size

  • marker colour

  • colour intensity and value

  • marker texture

  • line orientation

  • marker shape

  • motion/movie

These elements do not always mix well – efficient design is as much art as it is science.

The following examples, along with concise descriptions of key components and lists of questions that they could help answer, highlight charts’ strengths (and limitations). Some additional diagrams showcasing the four presentation types discussed above are also provided.

Bubble Chart

Example: Health and Wealth of Nations (see Figure 9.2)

Gapminder's Health and Wealth of Nation (2012)

Figure 9.2: Health and Wealth of Nations, in 2012 (Gapminder Foundation).

  • Data:

    • 2012 life expectancy in years

    • 2012 inflation adjusted GDP/capita in USD

    • 2012 population for 193 UN members and 5 other countries

  • Some Questions and Comparisons:

    • Can we predict the life expectancy of a nation given its GDP/capita?
      (The trend is roughly linear: \(\mbox{Expectancy}\approx 6.8 \times \ln \mbox{GDP/capita} + 10.6\))

    • Are there outlier countries? Botswana, South Africa, and Vietnam, at a glance.

    • Are countries with a smaller population healthier? Bubble size seems uncorrelated with the axes’ variates.

    • Is continental membership an indicator of health and wealth levels? There is a clear divide between Western Nations (and Japan), most of Asia, and Africa.

    • How do countries compare against world values for life expectancy and GDP per capita? The vast majority of countries fall in three of the quadrants. There are very few wealthy countries with low life expectancy. China sits near the world values, which is expected for life expectancy, but more surprising when it comes to GDP/capita – compare with India.

  • Multivariate Elements:

    • positions for health and wealth

    • bubble size for population

    • colour for continental membership

    • labels to identify the nations

  • Comments:

    • Are life expectancy and GDP/capita appropriate proxies for health and wealth?

    • A fifth element could also be added to a screen display: the passage of time. In this case, how do we deal with countries coming into existence (and ceasing to exist as political entities)?

Choropleth Map

Example: Mean Elevation by U.S. State (see Figure 9.3)

Mean elevation by U.S. state; high resolution elevation mapMean elevation by U.S. state; high resolution elevation map

Figure 9.3: Mean elevation by U.S. state, in feet (source unknown); contrast with high resolution elevation map (by twitter user @cstats.)

  • Data: 50 observations, ranging from sea level (0-250) to (6000+)

  • Some Questions and Comparisons:

    • Can the mean elevation of the U.S. states tell us something about the global topography of the U.S.? West has higher mean elevation related to the presence of the Rockies; Eastern coastal states are more likely to suffer from rising water levels, for instance.

    • Are there any states that do not “belong” in their local neighbourhood, elevation-wise? West Virginia and Oklahoma seem to have the “wrong” shade – is that an artifact of the colour gradient and scale in use?

  • Multivariate Elements: geographical distribution and purple-blue colour gradient (as the marker for mean elevation)

  • Comments:

    • Is the ‘mean’ the right measurement to use for this map? It depends on the author’s purpose.

    • Would there be ways to include other variables in this chart? Population density with texture, for instance.

Network Diagram

Example: Lexical Distances (see Figure 9.4).

Lexical distance of European languages

Figure 9.4: Lexical distance of European languages (T. Elms, [162]).

  • Data:

    • speakers and language groups for 43 European languages

    • lexical distances between languages

  • Some Questions and Comparisons:

    • Are there languages that are lexically closer to languages in other lexical groups than to languages in their own groups? French is lexically closer to English than it is to Romanian, say.

    • Which language has the most links to other languages? English has 10 links.

    • Are there languages that are lexically close to multiple languages in other groups? Greek is lexically close to 5 groups.

    • Is there a correlation between the number of speakers and the number of languages in a language group? Language groups with more speakers tend to have more languages.

    • Does the bubble size refer only to European speakers? Portuguese is as large as French?

  • Multivariate Elements:

    • colour and cluster for language group

    • line style for lexical distance

    • bubble size for number of speakers

  • Comments:

    • How is lexical distance computed?

    • Some language pairs are not joined by links – does this mean that their lexical distance is large enough not to be rendered?

    • Are the actual geometrical distances meaningful? For instance, Estonian is closer to French in the chart than it is to Portuguese – is it also lexically closer?

9.1.4 Visualization Catalogue

Here are some examples of other types of visualizations; more comprehensive catalogues can be found in [143], [143], [163][166], among others.

Classification scheme for the kyphosis dataset.

Figure 9.5: Decision Tree: classification scheme for the kyphosis dataset (personal file).

Histogram of reported weekly work hours.

Figure 9.6: Histogram of reported weekly work hours (personal file).

Estimated average project effort over-layed over product complexity, programmer capability, and product count in NASA's COCOMO dataset.

Figure 9.7: Decision tree bubble chart: estimated average project effort (in red) over-layed over product complexity, programmer capability, and product count in NASA’s COCOMO dataset (personal file).

Diagnosis network around COPD in the Danish Medical Dataset.

Figure 9.8: Association rules network: diagnosis network around COPD in the Danish Medical Dataset [126].

Classification of two categories in an artificial dataset.

Figure 9.9: Classification scatterplot: artificial dataset (personal file).

lassification bubble chart: Hertzsprung-Russell diagram of stellar evolution.

Figure 9.10: Classification bubble chart: Hertzsprung-Russell diagram of stellar evolution (European Southern Observatory).

Trend, seasonality, shifts of a supply chain metric.

Figure 9.11: Time series: trend, seasonality, shifts of a supply chain metric (personal file).

9.1.5 A Word About Accessibility

While visual displays can help provide analysts with insight, some work remains to be done in regard to visual impairment – short of describing the features/emerging structures in a visualization, graphs can at best succeed in conveying relevant information to a subset of the population.

The onus remains on the analyst to not only produce clear and meaningful visualizations (through a clever use of contrast, say), but also to describe them and their features in a fashion that allows all to “see” the insights. One drawback is that in order for this description to be done properly, the analyst needs to have seen all the insights, which is not always possible. Examples of “data physicalizations” can be found in [167].

References

[126]
A. B. Jensen et al., “Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients,” Nature Communications, vol. 5, 2014, doi: 10.1038/ncomms5022.
[143]
P. Boily, S. Davies, and J. Schellinck, Practical Data Visualization. Data Action Lab/Quadrangle, 2022.
[160]
@DamianMingle, Twitter.
[161]
E. Tufte, Beautiful Evidence. Graphics Press, 2008.
[162]
T. Elms, Lexical Distance of European Languages. Etymologikon, 2008.
[163]
A. Cairo, The Functional Art. New Riders, 2013.
[166]
I. Meireilles, Design for Information. Rockport, 2013.
[167]