8.8 Exercises

  1. The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause. The algae_bloom.csv dataset consists of: chemical properties of various water samples of European rivers; the quantity of seven (biological) algae in each of the samples, and other characteristics of the collection process for each sample.

    1. Identify questions that could be tackled with this dataset.

    2. Determine the structure of the dataset, and provide a summary of its features.

    3. What can you say about the dataset, in terms of missing values and of the ranges of its values?

    4. Do 2-way and 3-way tables (for the categorical variables) provide you with additional insights about the dataset?

    5. Provide some simple (univariate and multivariate) visualizations of season, mn02, NH4+, a1, a3, and at least one other variable of your choice.

    6. Does your analysis above suggest that there are anomalies in the dataset? Take action as needed.

    7. Identify observations (cases) with only 1 missing value, 2 missing values, and so on. Are there strategies that would allow you to handle some of the cases (hint: what is the relationship between PO4 and oPO4, for instance)? Are there observations that should be removed from the dataset?

    8. Produce a clean dataset to be used in analysis, with justification.

  2. Consider the datasets GlobalCitiesPBI.csv, 2016collisionsfinal.csv, polls_us_election_2016.csv, and HR_2016_Census_simple.csv. For each dataset:

    1. Create a “data dictionary” to explain the different fields and variables. Can you find a source for these datasets online?

    2. Develop a list of questions you would like answered about the datasets.

    3. Investigate individual variables (through simple charts, univariate statistics, etc.).

    4. Repeat the process with bivariate investigations (though simple charts, joint distributions, variable interactions, etc.).

    5. Do you trust the dataset, or not? Support your answer. If you do not trust the dataset, flag potential invalid entries, anomalous observations, missing values, or outliers. How should these entries be treated?

    6. Does any of your analysis suggest that some of the variables should be transformed? Do any of the questions you developed in step 2 support such transformations? If so, transform the data appropriately.

  3. Repeat the last question with any dataset of your liking.

  4. The remaining exercises use the Gapminder Tools (there is also an offline version).

    1. Explore the dataset with the Gapminder Tools in its default configuration. Do you think that there could be problems with the reported values? For instance, select Sweden and the United States from the checkbox menu on the right and follow their path from 1799 to 2018/2020. From what point onwards are the values sensible? What do you think is happening at the start of the time series?

    2. Follow Eritrea for the same duration. Look up the country’s independence date from Ethiopia. What do you think the measurements prior to that date represent?

    3. Follow Austria for the same duration. Look up the historical timeline of the country’s boundaries (Austria-Hungary, Anschluss, modern borders, etc.). What does that imply for the measurements?

    4. Follow Finland for the same duration. What happens in 1809? Does that tell you anything about the way data is coded in the dataset?

    5. De-select all countries and let the simulation run from 1799 to 2018/2020. Can you identify instances where a large subset of observations behaves in unexpected manners? If so, do you think that this is due to data cleaning/data processing issues?

    6. Continue exploring the dataset. You may change which variables are displayed or work with some of the other visualization methods. Overall, do you think that the dataset is sound? Would you use it to run analyses? What are some of its strengths and weaknesses?