11.8 Exercises

  1. What are some examples of supervised and unsupervised learning tasks in the business world? In a public policy/government setting? In a scientific setting?

  2. Assuming that data mining techniques are used in the following cases, identify whether the required task falls under supervised or unsupervised learning.

    1. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers).

    2. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying pattern in prior transactions.

    3. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets with a known threat status.

    4. Identifying segments of similar customers.

    5. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms.

    6. Estimating the repair time required for an aircraft based on a trouble ticket.

    7. Automated sorting of mail by zip code scanning.

    8. It is more difficult and expensive to win new customers than it is to retain existing customers. Scoring each customer on their likelihood to quit can help an organization design effective interventions, such as discounts or free services, to retain profitable customers in a cost-effective manner.

    9. Some medical practitioners conduct unnecessary tests and/or over-bill their government or insurance companies. Using audit data, it may be possible to identify such providers and take appropriate action.

    10. A market basket analysis can help develop predictive models to determine which products often sell together. This knowledge of affinities between products can help retailers create promotional bundles to push non-selling items along a set of products that sell well.

    11. Diagnosing the cause of a medical condition is the crucial first step in medical engagement. In addition to the current condition, other factors can be considered, including the patient’s health history, medication history, family’s history, and other environmental factors. A predictive model can absorb all of the information available to date (for this patient and others) and make probabilistic diagnoses, in the form of a decision tree, taking away most of the guess work involved.

    12. Schools can develop models to identify students who are at risk of not returning to school. Such students can be flagged to be on the receiving end of potential corrective measures.

    13. In addition to customer data, telecom companies also store call detail records (CDR), which precisely describe the calling behaviour of each customer. The unique data can be used to profile customers, who may be marketed to based on the similarity of their CDR to other customers’.

    14. Statistically, all equipment is likely to break down at some point in time. Predicting which machine is likely to shut down is a complex process. Decision models to forecast machinery failure could be constructed using past data, which can lead to savings provided by preventative maintenance.

    15. Identifying which tweets contain disinformation and which tweets are legitimate.

  3. Would the results of the Danish medical study be applicable to the Canadian context? To the Chinese context? What do you think some of the ethical/technical challenges were?

  4. Evaluate the following candidate association rules for the British Musical Dataset introduced in Section 11.3.1:

    • If an individual owns a classical music album (\(W\)), then they also own a hip-hop album (\(Z\)), given that \(\text{Freq}(W)=2010\), \(\text{Freq}(Z)=6855\), and \(\text{Freq}(W\cap Z)=132\).

    • If an individual owns both the Beatles’ Sergeant Peppers’ Lonely Hearts Club Band and a classical music album, then they were born before 1976, given that \(\text{Freq}(Y\cap W)=1852\) and \(\text{Freq}(Y\cap W\cap X)=1778\).

  5. Out of the 3 rules that have been established in the previous question (\(X\to Y\), \(W\to Z\), and \((Y \text{ AND } W)\to X\)), which do you think is more useful? Which is more surprising?

  6. A store that sells accessories for smart phones runs a promotion on faceplates. Customers who purchase multiple faceplates from a choice of 6 different colours get a discount. The store managers, who would like to know what colours of faceplates are likely to be purchased together, collected past transactions in the file Transactions.csv. Consider the following rules:

    • {red, white} \(\to\) {green}

    • {green} \(\to\) {white}

    • {red, green} \(\to\) {white}

    • {green} \(\to\) {red}

    • {orange} \(\to\) {red}

    • {white, black} \(\to\) {yellow}

    • {black} \(\to\) {green}

    1. For each rule, compute the support, confidence, interest, lift, and conviction.

    2. Amongst the rules for which the support is positive (\(>0\)), which one has the highest lift? Confidence? Interest? Conviction?

    3. Build an additional 5-10 candidate rules (randomly), and evaluate them. Which of the 12-17 candidate rules do you think would be most useful for the store managers?

    4. How would one determine reasonable threshold values for the support, coverage, interest, and lift of rules derived from a given dataset?

  7. Consider the following datasets: GlobalCitiesPBI.csv, 2016collisionsfinal.csv, polls_us_election_2016.csv, HR_2016_Census_simple.csv, and Transactions.csv.

    1. Determine what the data is reporting on / what it is about / create a “data dictionary” to explain the different fields and variables in the dataset.

    2. Develop a list of questions you would like to answer about the data.

    3. Investigate variables (individual, bivariate, multivariate) through charts, distributions, variable interactions, summary statistics, etc.

    4. Do you trust the data or not? Why? If you don’t trust it, flag some potential issues with the data/specific entries.

    5. Conduct an association rule mining analysis of the datasets. Using either the brute force approach or the apriori algorithm, determine 10-20 strong association rules. Visualize them, and interpret their results.

  8. UniversalBank is looking at converting its liability customers (i.e., customers who only have deposits at the bank) into asset customers (i.e., customers who have a loan with the bank). In a previous campaign, UniversalBank was able to convert 9.6% of 5000 of its liability customers into asset customers. The marketing department would like to understand what combination of factors make a customer more likely to accept a personal loan, in order to better design the next conversion campaign. UniversalBank’s dataset contains data on 5000 customers, including the following measurements: age, years of professional experience, yearly income (in thousands of USD), family size, value of mortgage with the bank, whether the client has a certificate of deposit with the bank, a credit card, etc. They build 2 decision trees on a training subset of 3000 records to predict whether a customer is likely to accept a personal loan (1) or not (0).

    1. Explore the UniversalBank.csv dataset. Can you come up with a reasonable guess as to what each of the variables represent?

    2. How many variables are used in the construction of tree \(A\)? Of tree \(B\)?

    3. Are the following decision rules valid or not for trees \(A\) and/or \(B\)?

    • \(\text{IF}\ (\text{Income}\geq 114)\ \text{AND}\ (\text{Education}\geq 1.5) \ \text{THEN}\ (\text{Personal Loan} = 1)\)

    • \(\text{IF}\ (\text{Income}< 92)\ \text{AND}\ (\text{CCAvg}\geq 3) \ \text{AND}\ (\text{CD.Account}< 0.5)\ \text{THEN}\ (\text{Personal Loan} = 0)\)

    1. What prediction would trees \(A\) and \(B\) make for a customer with:
    • a yearly income of 94,000$USD (Income = 94),

    • 2 kids (Family = 4),

    • no certificate of deposit with the bank (CD.Account = 0),

    • a credit card interest rate of 3.2% (CCAvg = 3.2), and

    • a graduate degree in Engineering (Education = 3)?

  9. The confusion matrices for the predictions of trees \(A\) and \(B\) on the remaining 2000 testing observations are shown below.

    1. Using the appropriate matrices, compute the 9 performance evaluation metrics for each of the trees (on the testing set).

    2. If customers who would not accept a personal loan get irritated when offered a personal loan, what tree should UniversalBank’s marketing group use to help maintain good customer relations?

  10. Consider the Algae Bloom Dataset of Section 9.5.3. We try to build a model to predict the presence/absence of algeas based on various chemical properties of river water. What is the data science motivation for such a model? After all, we can simply analyze water samples to determine if various harmful algae are present or absent. The answer is simple: chemical monitoring is cheap and easy to automate, whereas biological analysis of samples is expensive and slow. Another answer is that analyzing the samples for harmful content does not provide a better understanding of algae drivers: it just tells us which samples contain algae.

    1. Load the data and summarize/visualize it: you will be tasked with predicting the presence/absence of algae a1 and a2.

    2. Clean the data and impute missing values, as needed.

    3. Remove 20% of the observations and save them to a validation set.

    4. Create a training/testing pair on the remaining 80% of the observations and train 2 decision trees to predict the presence/absence of algaes a1 and a2, respectively. Evaluate the performance of each model. Which models performs best on your training/testing pair?

    5. Repeat step 4 on at least 20 distinct training/testing pairs. Evaluate the performance of each model, and save them.

    6. For each algae, pick the best of the models (how would you determine this) and use it to make predictions for the readings in the validation set. Evaluate these predictions.

    7. Instead of picking the best of the 20+ models, find some way to combine the results of the 20 models and to make predictions for the readings in the validation set. Evaluate these predictions.

    8. Which of the resulting models of steps 6 or 7 provide the best performance? Which are easier to interpret?

  11. Repeat question 8, using the same validation set in part 3. In part 4., use the remaining 80% of the data to build a decision tree (do not split into a training/testing pair first). Use these models to make predictions for the readings in the validation set. Evaluate these predictions. Is there evidence of overfitting?

  12. Repeat question 8, using the same validation set in part 3. In parts 4. to 7., use decision stumps (decision trees with only 1 branching point) instead of full growth trees. Is there evidence of underfitting?

  13. The population of Canada is divided physically into provincial and territorial areas, most of which are further subdivided into health regions. The Census information (from 2016) is available for those health regions. The equivalent 2018 dataset has been clustered to produce peer groups: the result is shown here. The data is found in the file HR_2016_Census_simple.csv.

    1. Load the data and summarize/visualize it (extract the rows with a 4-digit geocode).

    2. Clean the data and impute missing values (if necessary). Scale the data and assign to a new set.

    3. Run the \(k-\)means algorithm (with Euclidean distance) on the scaled data, using ALL the features, for \(k= 3,... , 16\). Use the Davies-Bouldin index and the Within-SS index to determine the optimal number of clusters. Is the optimal clustering scheme plausible?

  14. Reduce the dimension of the health region dataset by running a principal component analysis (PCA) and keep the principal components that explain up to 80% of the variability in the data. Repeat step 3. Are the results significantly different than they were for question 11?

  15. Run \(k-\)means on the original health regions data (previous question) and on the reduced data, for the same range of \(k-\)values, but replicate the process 30+ times per value of \(k\). What are the optimal \(k\) values in the aggregate runs?

  16. Save the cluster assignments for each run with the optimal values of \(k\) found in question 13. Say that two observations \(A\) and \(B\) have similarity \(w(A,B)\in [0,1]\) if \(A\) and \(B\) lie in the same cluster in \(w(A,B)\)% of the runs. What are some observations with high similarity measurements? With low similarity measurements?

  17. Provide a \(k-\)means clustering schema for the UniversalBank dataset.

  18. The remaining exercises use the Gapminder Tools (there is also an offline version).

    1. In the default configuration, we can identify some potential association rules. Using visual and ballpark estimates, evaluate the performance of the following rules:
    • Income $> 8000 $ Life Expectancy \(> 70\)

    • Income \(< 8000\) AND Life Expectancy $< 70 $ World Region \(=\) Africa

    1. Play around with various charts and variables and identify/evaluate 5+ additional association rules.

    2. Identify groups of similar countries, in 2018 [be sure to validate your groups using various charts].

    3. In the default configuration, follow the trajectories of Finland, Sweden, Iceland, Norway, and Denmark between 1900 and 2018. Do the countries appear to follow similar trajectories? Are there outliers or anomalous trajectories?

    4. Repeat step 4 for Brazil, Paraguay, Uruguay, Venezuela, Colombia, Peru, and Ecuador.

    5. Based on your results in steps 4 and 5, would you expect the trajectory for Argentina to be more like those of the Nordic countries or those of the South American countries? Or perhaps neither? Is your answer the same over all time horizons?