References

[1]
H. Wickham and G. Grolemund, R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, 2017.
[2]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2008.
[3]
G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer, 2014.
[4]
C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms and Applications. CRC Press, 2014.
[5]
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer, 2015.
[6]
C. C. Aggarwal, Ed., Data Classification: Algorithms and Applications. CRC Press, 2015.
[7]
I. Stewart, J. Cohen, and T. Pratchett, The Science of Discworld. Ebury Publishing, 2002.
[8]
[9]
[10]
R. Kabacoff, R in Action, Second. Manning, 2015.
[11]
[12]
R. D. Peng, R Programming for Data Science. Lulu.com, 2012.
[13]
R. Duursma, J. Powell, and G. Stone, A Learning Guide to R. Scribd, 2017.
[14]
J. H. Maindonald, “Using R for Data Analysis and Graphics Introduction, Code and Commentary,” 2004.
[15]
[16]
J. VanderPlas, Python Data Science Handbook : Essential tools for working with data. Sebastopol, CA: O’Reilly Media, Inc, 2016.
[17]
W. McKinney, Python for Data Analysis : Agile tools for real-world data. Sebastopol, CA: O’Reilly, 2013.
[18]
J. Kazil and K. Jarmul, Data Wrangling with Python: Tips and tools to make your life easier. O’Reilly Media, 2016.
[19]
Y. Xie, C. Dervieux, and E. Riederer, R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC, 2020.
[20]
D. Bertsimas and J. Tsitsiklis, Introduction to Linear Optimization, 1st ed. Athena Scientific, 1997.
[21]
D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999.
[22]
G. Cornuéjols, Valid inequalities for mixed integer linear programs,” Math. Program., vol. 112, no. 1, pp. 3–44, Mar. 2008.
[23]
H. P. Williams, “Model building in linear and integer programming,” in Computational mathematical programming, 1985, pp. 25–53.
[24]
C. Mar-Molinero, D. Prior, M.-M. Segovia, and F. Portillo, On centralized resource utilization and its reallocation by using DEA,” Ann. Oper. Res., vol. 221, no. 1, pp. 273–283, 2014.
[25]
S. Lozano and G. Villa, Centralized resource allocation using data envelopment analysis,” Journal of Productivity Analysis, vol. 22, no. 1, pp. 143–161, 2004.
[26]
E. Ghashim and P. Boily, A Soft Introduction to Bayesian Data Analysis,” Data Science Report Series, 2020.
[27]
E. T. Jaynes, Probability Theory: the Logic of Science. Cambridge Press, 2003.
[28]
A. Kolmogorov, Foundations of The Theory of Probability. Chelsea Publishing Company, 1933.
[29]
Mathematical Association, UK, An Aeroplane’s Guide to A Level Maths.”
[30]
Wikipedia, List of probability distributions,” 2021.
[31]
R. E. Walpole, R. H. Myers, S. L. Myers, and K. Ye, Probability and Statistics for Engineers and Scientists, 8th ed. Pearson Education, 2007.
[32]
R. V. Hogg and E. A. Tanis, Probability and Statistical Inference, 7th ed. Pearson/Prentice Hall, 2006.
[33]
H. Sahai and M. I. Ageel, The Analysis of Variance: Fixed, Random and Mixed Models. Birkhäuser, 2000.
[34]
M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li, Applied Linear Statistical Models. McGraw Hill Irwin, 2004.
[35]
M. Hollander and D. A. Wolfe, Nonparametric Statistical Methods, 2nd ed. Wiley, 1999.
[36]
P. Bruce and A. Bruce, Practical Statistics for Data Scientists: 50 Essential Concepts. O’Reilly, 2017.
[37]
D. S. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial (2nd ed.). Oxford Science, 2006.
[38]
M. L. Rizzo, Statistical Computing with R. CRC Press, 2007.
[39]
A. Reinhart, Statistics Done Wrong: the Woefully Complete Guide. No Starch Press, 2015.
[40]
D. S. Sivia and J. Skilling, Data analysis: A Bayesian tutorial (2nd ed.). Oxford Science, 2006.
[41]
E. W. Gibson, “The role of \(p-\)values in judging the strength of evidence and realistic replication expectations,” Statistics in Biopharmaceutical Research, vol. 13, no. 1, pp. 6–18, 2021.
[42]
Survey Methods and Practices, Catalogue no.12-587-X. Statistics Canada.
[43]
[44]
[45]
D. Hidiroglou M. and G. Gray, “A framework for measuring and reducing non-response in surveys,” Survey Methodology, vol. 19, no. 1, pp. 81–94, 1993.
[46]
A. Gower, “Questionnaire design for business surveys,” vol. 20, no. 2, pp. 125–136, 1994.
[47]
R. Latpate, J. Kshirsagar, V. K. Gupta, and G. Chandra, Advanced sampling methods. Springer Nature Singapore, 2021.
[48]
Méthodes de sondage pour les enquêtes statistiques agricoles. Rome: FAO. Développement Statistique.
[49]
S. L. Lohr, Sampling: Design and Analysis. Duxbury Press, 1999.
[50]
Consultancy UK, Types of Consultants.
[51]
[52]
[53]
Data Action Lab, IQC Blog, 2021.
[54]
Project Management Institute, PMP Reference List.
[55]
Project Management Institute, Certifications.
[56]
D. R. Hofstadter, Gödel, Escher, Bach: an Eternal Golden Braid. New York, NY: Basic Books, 1979.
[57]
[58]
[59]
Public Services and Procurement Canada, ProServices.”
[60]
D. Maister, Managing the Professional Services Firm. Free Press, 1993.
[61]
R. Crandall, Marketing Your Services: For People Who Hate to Sell. McGraw-Hill, 2002.
[62]
D. Maister, C. Green, and R. Galford, The Trusted Advisor. Free Press, 2001.
[63]
S. M. Gerson, A Teacher’s Guide to Technical Writing.” Kansas Curriculum Center, Washburn University.
[64]
[65]
M. H. Larock, J. C. Tressler, and C. E. Lewis, Mastering Effective English, 4th ed. Copp Clark Professional, 1980.
[66]
G. D. Gopen and J. A. Swan, The Science of Scientific Writing,” American Scientist, vol. Volume 78, 1990.
[67]
J. M. Williams, Style: Ten Lessons in Clarity and Grace. Pearson, 2004.
[68]
D. Hacker and N. Sommers, The Bedford Handbook, 9th ed. Bedford, 2013.
[69]
[70]
W. Whitman, When I Heard the Learn’d Astronomer.” 1865.
[71]
Wikipedia, Astronomy.” 2021.
[72]
The uOttawa Writing Centre, The Parts of the Sentence.
[73]
[74]
L. Torgo, Data Mining with R, 2nd ed. CRC Press, 2016.
[75]
T. H. Davenport and D. J. Patil, Data Scientist: The Sexiest Job of the 21st Century,” Harvard Business Review, Oct. 2012.
[76]
L. Donnelly, “Robots are better than doctors at diagnosing some cancers, major study finds,” The Telegraph, May 2018.
[77]
P. A. B. Bien Nicholas AND Rajpurkar, “Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet,” PLOS Medicine, vol. 15, no. 11, pp. 1–19, 2018, doi: 10.1371/journal.pmed.1002699.
[78]
[79]
Columbia University Irving Medical Center, Data scientists find connections between birth month and health,” Newswire.com, Jun. 2015.
[80]
[81]
S. Reichman, “These AI-invented paint color names are so bad, they’re good,” Curbed, May 2017.
[82]
[83]
[84]
Indiana University, Scientists use Instagram data to forecast top models at New York Fashion Week,” Science Daily, Sep. 2015.
[85]
J. Hiner, How big data will solve your email problem,” ZDNet, Oct. 2013.
[86]
[87]
[88]
E. Yong, Wait, have we really wiped out 60% of animals? The Atlantic, Oct. 2018.
[89]
[90]
[91]
[92]
S. Ramachandran and J. Flint, At Netflix, who wins when it’s Hollywood vs. The algorithm? Wall Street Journal, Nov. 2018.
[93]
[94]
D. Lewis, An AI-written novella almost won a literary prize,” Smithsonian Magazine, Mar. 2016.
[95]
[96]
T. Rikert, A.I. hype has peaked so what’s next? TechCrunch, Sep. 2017.
[97]
J. C. Scott, Against the grain: A deep history of the earliest states. New Haven: Yale University Press, 2017.
[98]
R. Mérou, Conceptual map of free software.” Wikimedia, 2010.
[99]
Henning (WMDE), UML diagram of the wikibase data model.” Wikimedia.
[100]
Wooptoo, Entity - relationship model.” Wikimedia.
[101]
S. L. Lee and D. Baer, 20 cognitive biases that screw up your decisions,” Business Insider, Dec. 2015.
[102]
Cognitive biases.” The Decision Lab.
[103]
R. Schutt and C. O’Neill, Doing Data Science: Straight Talk from the Front Line. O’Reilly, 2013.
[104]
“Research integrity & ethics.” Memorial University of Newfoundland.
[105]
[106]
J. Schellinck and P. Boily, Data, automation, and ethics,” Data Science Report Series, 2020.
[107]
Code of ethics/conducts.” Certified Analytics Professional.
[108]
Development of national statistical systems.” United Nations, Statistics Division.
[109]
ACM code of ethics and professional conduct.” Association for Computing Machinery.
[110]
K. Fung, The ethics conversation we’re not having about data,” Harvard Business Review, Nov. 2015.
[111]
[112]
[113]
R. W. Paul and L. Elder, Understanding the Foundations of Ethical Reasoning, 2nd ed. Foundation for Critical Thinking, 2006.
[114]
Centre for big data ethics, law, and policy.” Data Science Institute, University of Virginia.
[115]
Open data.” Wikipedia.
[116]
[117]
[118]
J. S. A. Corey, The Expanse. Orbit Books, 2011--2021.
[119]
[120]
A. Gumbus and F. Grodzinsky, Era of Big Data: Danger of discrimination,” ACM SIGCAS Computers and Society, vol. 45, no. 3, pp. 118–125, 2015.
[121]
[122]
[123]
I. Asimov, Foundation series. Gnome Press, Spectra, Doubleday, 1942--1993.
[124]
I. Stewart, The fourth law of humanics,” Nature, vol. 535, 2016.
[125]
J. Cranshaw, R. Schwartz, J. I. Hong, and N. M. Sadeh, The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City,” in ICWSM.
[126]
A. B. Jensen et al., “Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients,” Nature Communications, vol. 5, 2014, doi: 10.1038/ncomms5022.
[127]
K.-W. Hsu, N. Pathak, J. Srivastava, G. Tschida, and E. Bjorklund, “Data Mining Based Tax Audit Selection: A Case Study of a Pilot Project at the Minnesota Department of Revenue,” in Real World Data Mining Applications, Cham: Springer International Publishing, 2015, pp. 221–245. doi: 10.1007/978-3-319-07812-0_12.
[128]
F. R. Bach and M. I. Jordan, “Learning spectral clustering, with application to speech separation,” J. Mach. Learn. Res., vol. 7, pp. 1963–2001, Dec. 2006.
[129]
H. T. Kung and D. Vlah, “A spectral clustering approach to validating sensors via their peers in distributed sensor networks,” Int. J. Sen. Netw., vol. 8, no. 3/4, pp. 202–208, Oct. 2010, doi: 10.1504/IJSNET.2010.036195.
[130]
[131]
C. Plant et al., Automated detection of brain atrophy patterns based on MRI for the prediction of alzheimer’s disease,” NeuroImage, vol. 50, no. 1, pp. 162–174, 2010.
[132]
S. E. Brossette, A. P. Sprague, J. M. Hardin, K. B. Waites, W. T. Jones, and S. A. Moser, Association Rules and Data Mining in Hospital Infection Control and Public Health Surveillance,” Journal of the American Medical Informatics Association, vol. 5, no. 4, pp. 373–381, Jul. 1998, doi: 10.1136/jamia.1998.0050373.
[133]
M. Kosinski and Y. Wang, “Deep neural networks are more accurate than humans at detecting sexual orientation from facial images,” Journal of Personality and Social Psychology, vol. 114, no. 2, pp. 246–257, Feb. 2018.
[134]
J. Taylor, Four problems in using CRISP-DM and how to fix them,” KDnuggets.com, 2017.
[135]
[136]
D. Woods, Bitly’s Hilary Mason on "what is a data scientist?",” Forbes, Mar. 2012.
[137]
F. Provost and T. Fawcett, Data Science for Business. O’Reilly, 2015.
[138]
[139]
boot4life, What JSON structure to use for key-value pairs.” StackOverflow, Jun. 2016.
[140]
[141]
N. Feldman, Data Lake or Data Swamp?, 2015.
[142]
P. Hapala et al., “Mapping the electrostatic force field of single molecules from high-resolution scanning probe images,” Nature Communications, vol. 7, no. 11560, 2016.
[143]
P. Boily, S. Davies, and J. Schellinck, Practical Data Visualization. Data Action Lab/Quadrangle, 2022.
[144]
[145]
Wikipedia, Cluster analysis algorithms.”
[146]
R. Sutton and G. Barto, Reinforcement Learning: an Introduction. MIT Press, 2018.
[147]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT press Cambridge, 2016.
[148]
Y. Cissokho, S. Fadel, R. Millson, R. Pourhasan, and P. Boily, Anomaly Detection and Outlier Analysis,” Data Science Report Series, 2020.
[149]
T. Orchard and M. Woodbury, A missing information principle: Theory and applications. University of California Press, 1972.
[150]
S. Hagiwara, “Nonresponse error in survey sampling: Comparison of different imputation methods.” Honours Thesis; School of Mathematics; Statistics, Carleton University, 2012.
[151]
T. Raghunathan, J. Lepkowski, J. Van Hoewyk, and P. Solenberger, “A multivariate technique for multiply imputing missing values using a sequence of regression models,” Survey Methodology, vol. 27, no. 1, pp. 85–95, 2001.
[152]
S. van Buuren, Flexible imputation of missing data. CRC Press, 2012.
[153]
D. B. Rubin, Multiple imputation for nonresponse in surveys. Wiley, 1987.
[154]
P. Boily, Principles of data collection,” Data Science Report Series, 2020.
[155]
[156]
[157]
O. Leduc, A. Macfie, A. Maheshwari, M. Pelletier, and P. Boily, Feature selection and dimension reduction,” Data Science Report Series, 2020.
[158]
[159]
D. Dua and C. Graff, “Liver disorders dataset at the UCI machine learning repository.” University of California, Irvine, School of Information; Computer Sciences, 2017.
[160]
@DamianMingle, Twitter.
[161]
E. Tufte, Beautiful Evidence. Graphics Press, 2008.
[162]
T. Elms, Lexical Distance of European Languages. Etymologikon, 2008.
[163]
A. Cairo, The Functional Art. New Riders, 2013.
[164]
A. Cairo, The Truthful Art. New Riders, 2016.
[165]
N. Yau, FlowingData.
[166]
I. Meireilles, Design for Information. Rockport, 2013.
[167]
[168]
Data Action Lab Podcast, Episode 3 - Minard’s March to Moscow, 2020.
[169]
Data Action Lab, Data Analysis Short Course, 2020.
[170]
R. A. Dahl, “Cause and effect in the study of politics,” in Cause and Effect, D. Lerner, Ed. New York: Free Press, 1965, pp. 75–98.
[171]
A. B. Hill, “The environment and disease: Association or causation?” Proc R Soc Med, vol. 58, no. 5, pp. 295–300, 1965.
[172]
Z. Gemignani and C. Gemignani, Data Fluency: Empowering Your Organization with Effective Data Communication. Wiley, 2014.
[173]
Z. Gemignani and C. Gemignani, A Guide to Creating Dashboards People Love to Use. (ebook).
[174]
S. Wexler, J. Shaffer, and A. Cotgreave, The Big Book of Dashboards. Wiley, 2017.
[175]
E. Tufte, The Visual Display of Quantitative Information. Graphics Press, 2001.
[176]
C. Nussbaumer Knaflic, Storytelling with Data. Wiley, 2015.
[177]
Matillion.com, Poor use of dashboard software.”
[178]
Geckoboard.com, Two terrible dashboard examples.”
[179]
H. Wickham, D. Navarro, and T. Lin Pedersen, ggplot2: Elegant Graphics for Data Analysis. Springer, 2021.
[180]
H. Wickham, “A layered grammar of graphics,” Journal of Computational and Graphical Statistics, no. 19, pp. 3–28, 2009.
[181]
[182]
H. Wickham, “Tidy data,” Journal of Statistical Software, vol. 59, no. 10, 2014.
[183]
W. Chang, R Graphics Cookbook. O’Reilly, 2013.
[184]
[185]
[186]
J. Kunigk, I. Buss, P. Wilkinson, and L. George, Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale. O’Reilly Media, 2018.
[187]
T. Malaska and J. Seidman, Foundations for Architecting Data Solutions: Managing Successful Data Projects. O’Reilly Media, 2018.
[188]
M. Kleppmann, Designing Data-Intensive applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media, 2017.
[189]
[190]
A. Dutrée, Data pipelines: What, why and which ones.” Towards Data Science, 2021.
[191]
A. Watt, Database Design. BCCampus, 2014.
[192]
[193]
Data governance.” Wikipedia.
[194]
DataOps.” Wikipedia.
[195]
[196]
D. Barber, Bayesian Reasoning and Machine Learning. Cambridge Press, 2012.
[197]
D. Dua and E. Karra Taniskidou, UCI Machine Learning Repository.” Irvine, CA: University of California, School of Information; Computer Science, 2017.
[198]
S. Canada, Athlete rebate.”
[199]
E. Siegel, Predictive analytics: The power to predict who will click, buy, lie or die. Predictive Analytics World, 2016.
[200]
E. Garcia, C. Romero, S. Ventura, and T. Calders, “Drawbacks and solutions of applying association rule mining in learning management systems,” 2007.
[201]
Wikipedia, Association rule learning.” 2020.
[202]
E. R. Omiecinski, “Alternative interest measures for mining associations in databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 1, pp. 57–69, 2003, doi: 10.1109/TKDE.2003.1161582.
[203]
G. Piatetsky-Shapiro, “Discovery, analysis, and presentation of strong rules,” 1991.
[204]
C. C. Aggarwal and P. S. Yu, “A new framework for itemset generation,” in Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, 1998, pp. 18–24. doi: 10.1145/275487.275490.
[205]
P.-N. Tan, V. Kumar, and J. Srivastava, “Selecting the right objective measure for association analysis,” Inf. Syst., vol. 29, no. 4, pp. 293–313, Jun. 2004, doi: 10.1016/S0306-4379(03)00072-3.
[206]
M. Hahsler and K. Hornik, New probabilistic interest measures for association rules,” CoRR, vol. abs/0803.0966, 2008.
[207]
[208]
J. Leskovec, A. Rajamaran, and J. D. Ullman, Mining of Massive Datasets. Cambridge Press, 2014.
[209]
M. Risdal, “Exploring survival on the titanic,” Kaggle.com, 2016.
[210]
B. Kitts et al., “Click fraud detection: Adversarial pattern recognition over 5 years at microsoft,” in Annals of information systems (special issue on data mining in real-world applications), Springer, 2015, pp. 181–201. doi: 10.1007/978-3-319-07812-0.
[211]
B. Kitts, “The making of a large-scale ad server,” 2013.
[212]
S. Fefilatyev et al., “Detection of anomalous particles from deepwater horizon oil spill using SIPPER3 underwater imaging platform,” in Data mining case studies IV, proceedings of the 11th IEEE international conference on data mining, Vancouver, BC: IEEE, 2011.
[213]
B. Kitts, “Product targeting from rare events: Five years of one-to-one marketing at CPI,” Marketing Science Conference, 2005.
[214]
T. Hastie, T. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity: The LASSO and Generalizations. CRC Press, 2015.
[215]
O. Leduc and P. Boily, Boosting with AdaBoost and gradient boosting,” Data Action Lab Blog, 2019.
[216]
C. F. Robert, Le choix bayésien - principes et pratique. Springer-Verlag France, 2006.
[217]
B. Efron, Large Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press, 2010.
[218]
A. Ng and K. Soo, Eds., Surviving a Disaster, in Numsense! algobeans, 2016.
[219]
D. H. Wolpert, “The lack of a priori distinctions between learning algorithms,” Neural Computation, vol. 8, no. 7, pp. 1341–1390, 1996, doi: 10.1162/neco.1996.8.7.1341.
[220]
D. H. Wolpert and W. G. Macready, “Coevolutionary free lunches,” IEEE Transactions on Evolutionary Computation, vol. 9, no. 6, pp. 721–735, 2005, doi: 10.1109/TEVC.2005.856205.
[221]
J. Chambers and T. Hastie, Statistical models in s. Wadsworth; Brooks/Cole, 1992.
[222]
E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN,” ACM Trans. Database Syst., vol. 42, no. 3, Jul. 2017, doi: 10.1145/3068335.
[223]
J. d’Huy, “Scientists trace society’s myths to primordial origins,” Scientific American (Online), Sep. 2016.
[224]
U. Habib, K. Hayat, and G. Zucker, “Complex building’s energy system operation patterns analysis using bag of words representation with hierarchical clustering,” Complex Adapt. Syst. Model., vol. 4, p. 8, 2016, doi: 10.1186/s40294-016-0020-0.
[225]
M. Orlowska et al., “A comparison of antioxidant, antibacterial, and anticancer activity of the selected thyme species by means of hierarchical clustering and principal component analysis,” Acta Chromatographica Acta Chromatographica, vol. 28, no. 2, pp. 207–221, 2016, doi: 10.1556/achrom.28.2016.2.7.
[226]
[227]
A. Jawad, K. Kersting, and N. Andrienko, “Where traffic meets DNA: Mobility mining using biological sequence analysis revisited,” in Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems, 2011, pp. 357–360. doi: 10.1145/2093973.2094022.
[228]
G. Schoier and G. Borruso, “Individual movements and geographical data mining. Clustering algorithms for highlighting hotspots in personal navigation routes,” in Computational science and its applications - ICCSA 2011, 2011, pp. 454–465.
[229]
[230]
B. Desgraupes, ClusterCrit: Clustering Indices. 2018.
[231]
Z. Cheng, J. Caverlee, K. Lee, and D. Z. Sui, Exploring millions of footprints in location sharing services,” in ICWSM, 2011.
[232]
R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, vol. 7, no. 7, pp. 179–188, 1936.
[233]
A. M. Raja, Penguins dataset overview - iris alternative,” Towards Data Science, Jun. 2020.
[234]
Q. E. McCallum, Bad Data Handbook. O’Reilly, 2013.
[235]
A. K. Maheshwari, Business Intelligence and Data Mining. Business Expert Press, 2015.
[236]
B. Boehmke and B. Greenwell, Hands on Machine Learning with R. CRC Press.
[237]
H. Rosling, O. Rosling, and A. R. Rönnlund, Factfulness: Ten reasons we’re wrong about the world - and why things are better than you think. Hodder & Stoughton, 2018.
[238]
H. Rosling, The health and wealth of nations. Gapminder Foundation, 2012.
[239]
D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, 1997.
[240]
G. E. P. Box, “Use and abuse of regression,” Journal of Technometrics, vol. 8, no. 4, pp. 625–629, Nov. 1966.
[241]
T. Hastie, R. Tibshirani, and M. Wainwright, Statistical Learning with Sparsity : the LASSO and Generalizations. CRC Press, 2015.
[242]
F. Chollet, Deep Learning with Python, 1st ed. USA: Manning Publications Co., 2017.
[243]
Wikipedia, Binary classification,” 2021.
[244]
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[245]
C. Sheppard, Tree-Based Machine Learning Algorithms: Decision Trees, Random Forests, and Boosting. CreateSpace Independent Publishing Platform, 2017.
[246]
T. Hofmann, B. Schölkopf, and A. J. Smola, Kernel Methods in Machine Learning,” Annals of Statistics, vol. 36, no. 3, pp. 1171–1220, 2008.
[247]
N. Deng, Y. Tian, and C. Zhang, Support Vector Machines: Optimization Based Theory, Algorithms, and Extensions. CRC Press/Chapman; Hall, 2013.
[248]
3Blue1Brown, Deep Learning.”
[249]
[250]
A. Turing, “Computing machinery and intelligence,” Mind, 1950.
[251]
Wikipedia, Artificial intelligence,” 2020.
[252]
M. Caudill, “Neural networks primer, part 1,” AI Expert, vol. 2, no. 12, pp. 46–52, Dec. 1987.
[253]
Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.” 2014.
[254]
R. S. Sutton, “Two problems with backpropagation and other steepest-descent learning procedures for networks,” 1986.
[255]
S. Ruder, “An overview of gradient descent optimization algorithms.” 2016.
[256]
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” ICLR, 2015.
[257]
R. M. Levenson, E. A. Krupinski, V. M. Navarro, and E. A. Wasserman, “Pigeons (columba livia) as trainable observers of pathology and radiology breast cancer images,” PLOS ONE, vol. 10, no. 11, pp. 1–21, Nov. 2015, doi: 10.1371/journal.pone.0141357.
[258]
J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” Annals of Statistics, vol. 28, p. 2000, 1998.
[259]
M. Grootendorst, 9 distance measures in data science,” Towards Data Science, Feb. 2021.
[260]
M. Harmouch, 17 types of similarity and dissimilarity measures used in data science,” Towards Data Science, Mar. 2021.
[261]
P. Boily and J. Schellinck, “Machine learning 101,” Data Science Report Series, 2021.
[262]
W. Sun, Y. He, and H. Chang, “Regional characteristics of CO2 emissions from china’s power generation: Affinity propagation and refined laspeyres decomposition,” International Journal of Global Warming, vol. 11, p. 38, Jan. 2017, doi: 10.1504/IJGW.2017.080989.
[263]
L. Wang, X. Zhou, Y. Xing, M. Yang, and C. Zhang, “Clustering ECG heartbeat using improved semi-supervised affinity propagation,” IET Software, vol. 11, Jun. 2017, doi: 10.1049/iet-sen.2016.0261.
[264]
J. Ning, L. Zhang, D. Zhang, and C. Wu, “Interactive image segmentation by maximal similarity based region merging,” Pattern Recognition, vol. 43, pp. 445–456, Feb. 2010, doi: 10.1016/j.patcog.2009.03.004.
[265]
J.-B. Sheu, “An emergency logistics distribution approach for quick response to urgent relief demand in disasters,” Transportation Research Part E: Logistics and Transportation Review, vol. 43, pp. 687–709, Nov. 2007, doi: 10.1016/j.tre.2006.04.004.
[266]
S. Hwang and J.-C. Thill, “Delineating urban housing submarkets with fuzzy clustering,” Environment and Planning B: Planning and Design, vol. 36, pp. 865–882, Sep. 2009, doi: 10.1068/b34111t.
[267]
O. Tominaga, F. Ito, T. Hanai, H. Honda, and T. Kobayashi, “Modeling of consumers’ preferences for regular coffee samples and its application to product design,” Food Science and Technology Research, vol. 8, pp. 281–285, Aug. 2002, doi: 10.3136/fstr.8.281.
[268]
Y. Murat and Z. Cakici, “An integration of different computing approaches in traffic safety analysis,” Transportation Research Procedia, vol. 22, pp. 265–274, Dec. 2017, doi: 10.1016/j.trpro.2017.03.033.
[269]
R. C. Amorim, “Feature relevance in ward’s hierarchical clustering using the lp norm,” J. Classif., vol. 32, no. 1, pp. 46–62, Apr. 2015, doi: 10.1007/s00357-015-9167-1.
[270]
Ward’s method.” Wikipedia.
[271]
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka, “Relative clustering validity criteria: A comparative overview,” Stat. Anal. Data Min., vol. 3, pp. 209–235, 2010.
[272]
J. M. Lewis, M. Ackerman, and V. R. de Sa, “Human cluster evaluation and formal quality measures: A comparative study,” Cognitive Science, vol. 34, 2012.
[273]
R. Yedida, Evaluating clusters.” Beginning with ML, 2019.
[274]
N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, Dec. 2010.
[275]
U. von Luxburg, “Clustering stability: An overview,” Foundations and Trends in Machine Learning, vol. 2, no. 3, pp. 235–274, 2010, doi: 10.1561/2200000008.
[276]
E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo, “A comparison of extrinsic clustering evaluation metrics based on formal constraints,” Inf. Retr., vol. 12, no. 5, p. 613, 2009.
[277]
T. Lange, M. Braun, V. Roth, and J. Buhmann, “Stability-based model selection,” Advances in Neural Information Processing Systems (NIPS 2002): 2002, Jun. 2003.
[278]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Proceedings of the 14th international conference on neural information processing systems: Natural and synthetic, 2001, pp. 849–856.
[279]
Y. Bengio, P. Vincent, and J.-F. Paiement, Learning Eigenfunctions of Similarity: Linking Spectral Clustering and Kernel PCA,” Département d’informatique et recherche opérationnelle, Université de Montréal, 1232, 2003.
[280]
U. von Luxburg, A Tutorial on Spectral Clustering,” Stat. Comput., vol. 17, no. 4, pp. 395–416, 2007.
[281]
F. Tung, A. Wong, and D. A. Clausi, “Enabling scalable spectral clustering for image segmentation,” Pattern Recognition, vol. 43, no. 12, pp. 4069–4076, 2010, doi: https://doi.org/10.1016/j.patcog.2010.06.015.
[282]
L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering,” in Advances in Neural Information Processing Systems, 2005, vol. 17.
[283]
D. Dueck, “Affinity propagation: Clustering data by passing messages,” Ph.D. Thesis, Jan. 2009.
[284]
B. Frey and D. Dueck, “Clustering by passing messages between data points,” Science (New York, N.Y.), vol. 315, pp. 972–6, Mar. 2007, doi: 10.1126/science.1136800.
[285]
A. Bagnaro, F. Baltar, and G. Brownstein, “Reducing the arbitrary: Fuzzy detection of microbial ecotones and ecosystems – focus on the pelagic environment,” Environmental Microbiome, vol. 15, no. 16, 2020.
[286]
D. Gustafson and W. C. Kessel, “Fuzzy clustering with a fuzzy covariance matrix,” 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes, pp. 761–766, 1978.
[287]
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990.
[288]
S. H. Kwon, J. Kim, and S. H. Son, “Improved cluster validity index for fuzzy clustering,” Electronics Letters, vol. 57, no. 21, pp. 792–794, 2021.
[289]
Y. Tang, F. Sun, and Z. Sun, “Improved validation index for fuzzy clustering,” in Proceedings of the 2005, american control conference, 2005., 2005, pp. 1120–1125.
[290]
X. L. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991.
[291]
[292]
Natural Stat Trick, Ottawa Senators @ Toronto Maple Leafs Game Log.” 2017.
[293]
Wikipedia, Macbeth.”
[294]
Tvtropes.org, Laconic Macbeth.”
[295]
W. Morrissette, Scotland, PA.” 2001.
[296]
J. B. Tenenbaum, V. de Silva, and J. C. Langford, A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, no. 5500, p. 2319, 2000.
[297]
S. T. Roweis and L. K. Saul, Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
[298]
[299]
scikit-learn.org, Manifold learning.”
[300]
Y. LeCun, C. Cortes, and C. J. C. Burges, The MNIST database of handwritten digits.”
[301]
scikit-learn.org, Manifold learning on handwritten digits.”
[302]
[303]
[304]
Eta-squared.”
[305]
[306]
[307]
Wikipedia, Mutual information.”
[308]
Y. Sun and D. Wu, “A RELIEF based feature extraction algorithm,” Apr. 2008, pp. 188–195.
[309]
F. E. Harrell, Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer International Publishing, 2015.
[310]
[311]
Y. Goldberg, Neural Network Methods for Natural Language Processing. Morgan; Claypool, 2017.
[312]
Z. A. Zhao and H. Liu, Spectral Feature Selection for Data Mining. CRC Press, 2011.
[313]
D. Duvenaud, Automatic Model Construction with Gaussian Processes,” PhD thesis, Computational and Biological Learning Laboratory, University of Cambridge, 2014.
[314]
T. Zhang and R. K. Ando, Analysis of spectral kernel design based semi-supervised learning,” in Advances in neural information processing systems, 2005, vol. 18.
[315]
A. J. Smola and R. Kondor, “Kernels and regularization on graphs,” in Learning Theory and Kernel Machines, 2003, pp. 144–158.
[316]
T. Gowers, J. Barrow-Green, and I. Leader, Eds., The Princeton Companion to Mathematics. Princeton University Press, 2008.
[317]
L. McInnes, J. Healy, and J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint, 2018.
[318]
D. Baron, Outlier Detection.” XXX Winter School of Astrophysics on Big Data in Astronomy, GitHub repository, 2018.
[319]
K. G. Mehrotra, C. K. Mohan, and H. Huang, Anomaly Detection Principles and Algorithms. Springer, 2017.
[320]
[321]
[322]
T. Le, M. T. Vo, B. Vo, M. Y. Lee, and S. W. Baik, “A hybrid approach using oversampling technique and cost-sensitive learning for bankruptcy prediction,” Complexity, 2019, doi: 10.1155/2019/8460934.
[323]
O. Soufan et al., “Mining chemical activity status from high-throughput screening assays,” PloS one, vol. 10, no. 12, 2015, doi: 10.1371/journal.pone.0144426.
[324]
C. C. Aggarwal, Outlier Analysis. Springer International Publishing, 2016.
[325]
“Outlier Detection: A Survey.” Technical Report TR 07-017; Department of Computer Science; Engineering, University of Minnesota, 2007.
[326]
V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artif. Intell. Rev., vol. 22, no. 2, pp. 85–126, 2004.
[327]
C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” SIGMOD Rec., vol. 30, no. 2, pp. 37–46, 2001, doi: http://doi.acm.org/10.1145/376284.375668.
[328]
P. D. Talagala, R. J. Hyndman, and K. Smith-Miles, “Anomaly detection in high dimensional data.” arXiv, 2019. doi: 10.48550/ARXIV.1908.04000.
[329]
E. Muller, I. Assent, U. Steinhausen, and T. Seidl, “OutRank: Ranking Outliers in High-Dimensional Data,” in 2008 IEEE 24th International Conference on Data Engineering Workshop, 2008, pp. 600–603. doi: 10.1109/ICDEW.2008.4498387.
[330]
S. Kandanaarachchi and R. Hyndman, “Dimension reduction for outlier detection using DOBIN.” Sep. 2019. doi: 10.13140/RG.2.2.15437.18403.
[331]
C. C. Aggarwal and S. Sathe, Outlier Ensembles: An Introduction. Springer International Publishing, 2017.
[332]
F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422.
[333]
S. Hariri, M. Carrasco Kind, and R. J. Brunner, “Extended isolation forest,” IEEE Transactions on Knowledge and Data Engineering, 2019.
[334]
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.
[335]
R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-Based Clustering Based on Hierarchical Density Estimates,” in Advances in Knowledge Discovery and Data Mining, 2013, pp. 160–172.
[336]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: Identifying Density-Based Local Outliers,” SIGMOD Rec., vol. 29, no. 2, pp. 93–104, 2000.
[337]
J. Zhang, M. Lou, T. W. Ling, and H. Wang, “Hos-miner: A System for Detecting Outlyting Subspaces of High-Dimensional Data,” in Proceedings of the Thirtieth International Conference on Very Large Data Bases, 2004, pp. 1265–1268.
[338]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Outlier detection in axis-parallel subspaces of high dimensional data,” in Advances in Knowledge Discovery and Data Mining, 2009, pp. 831–838.
[339]
E. Müller, I. Assent, P. Iglesias, Y. Mülle, and K. Böhm, “Outlier ranking via subspace analysis in multiple views of the data,” in 2012 IEEE 12th International Conference on Data Mining, 2012, pp. 529–538. doi: 10.1109/ICDM.2012.112.
[340]
E. Müller, M. Schiffer, and T. Seidl, “Adaptive outlierness for subspace outlier ranking,” in Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 2010, pp. 1629–1632.
[341]
E. Muller, M. Schiffer, and T. Seidl, “Statistical selection of relevant subspace projections for outlier ranking,” in Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, 2011, pp. 434–445. doi: 10.1109/ICDE.2011.5767916.
[342]
C. Chen and L.-M. Liu, “Joint estimation of model parameters and outlier effects in time series,” Journal of the American Statistical Association, vol. 88, pp. 284–297, 1993.
[343]
[344]
D. Whitenack, Machine Learning with Go. Packt Publishing, 2017.
[345]
L. McInnes, J. Healy, and S. Astels, How HDBSCAN works.” 2016.
[346]
J. Tang, Z. Chen, A. W. Fu, and D. W. Cheung, “Capabilities of outlier detection schemes in large datasets, framework and methodologies,” Knowl. Inf. Syst., vol. 11, no. 1, pp. 45–84, Jan. 2007.
[347]
Z. He, S. Deng, X. Xu, and J. Z. Huang, “A fast greedy algorithm for outlier mining,” in Advances in Knowledge Discovery and Data Mining, 2006, pp. 567–576.
[348]
A. Lazarević and V. Kumar, “Feature bagging for outlier detection,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005, pp. 157–166. doi: 10.1145/1081870.1081891.
[349]
A. Archimbaud, “Détection non-supervisée d’observations atypiques en contrôle de qualité : Un survol,” Journal de la Société Française de Statistique, vol. 159, no. 3, pp. 1–39, 2018.
[350]
T. Hastie, Leukemia dataset.”
[351]
Z. He, S. Deng, and X. Xu, “A unified subspace outlier ensemble framework for outlier detection,” in Advances in Web-Age Information Management, 2005, pp. 632–637.
[352]
S. Munzert, C. Rubba, P. Meiner, and D. Nyhuis, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, 2nd ed. Wiley Publishing, 2015.
[353]
R. Mitchell, Web Scraping with Python: Collecting Data From the Modern Web, 2nd ed. O’Reilly Media, 2018.
[354]
[355]
[356]
K. Jarmul, Natural Language Processing Fundamentals in Python.” DataCamp.
[357]
[358]
[359]
[360]
[361]
T. Bayes, “An essay towards solving a problem in the doctrine of chances,” Phil. Trans. of the Royal Soc. of London, vol. 53, pp. 370–418, 1763.
[362]
R. T. Cox, Probability, Frequency, and Reasonable Expectation,” American Journal of Physics, vol. 14, no. 1, 1946.
[363]
N. Silver, The Signal and the Noise. Penguin, 2012.
[364]
T. Oliphant, A Bayesian perspective on estimating mean, variance, and standard-deviation from data,” vol. 278. All Faculty Publications, BYU, 2006.
[365]
Wikipedia, Conjugate priors.”
[366]
J. K. Kruschke, Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd ed.). Academic Press, 2011.
[367]
[368]
[369]
A. Gelman, J. B. Carloin, H. S. Stern, A. Dunson D. B.and Vehtari, and D. B. Rubin, Bayesian data analysis (3rd ed.). CRC Press, 2013.
[370]
[371]
[372]
R. A. Poldrac, “Can cognitive processes be inferred from neuroimaging data?” Trends Cogn.Sci., 2006.
[373]
D. Hitchcock, Introduction to Bayesian Data Analysis (course notes). Department of Statistics, University of South Carolina, 2014.
[374]
A. K. Erlang, “The theory of probabilities and telephone conversations,” Nyt Tidsskrift for Matematik B, 1909.
[375]
R. Berry, Queueing Theory and Applications, 2nd ed. PWS/Kent Publishing, 2002.
[376]
L. Kleinrock, Queueing Systems, Volume I. Wiley, 1974.
[377]
S. M. Ross, Introduction to Probability Models, 11th ed. San Diego, CA, USA: Academic Press, 2014.
[378]
C. Newell, Applications of Queueing Theory. Springer Netherlands, 2013.
[379]
D. G. Kendall, Stochastic Processes Occurring in the Theory of Queues and their Analysis by the Method of the Imbedded Markov Chain,” The Annals of Mathematical Statistics, vol. 24, no. 3, pp. 338–354, 1953, doi: 10.1214/aoms/1177728975.
[380]
W. L. Winston, Operations Research: Applications and Algorithms. Cengage Learning, 2022.
[381]
“Management science and the gas shortage,” Interfaces, vol. 4, no. 4, pp. 47–51, Aug. 1974.

  1. In all cases, we have attempted to properly cite and give credit where it is due. Get in touch if you find omissions!↩︎

  2. We’re not saying that we won’t be adding examples in different languages in the future, but let’s not get ahead of ourselves.↩︎

  3. In the parlance of the field, let us simply say that some of the details are left as an exercise for the reader (and can also be found in the numerous references).↩︎

  4. Most programmers do not consider R to be a programming language. If they are feeling generous, they might dub it a scripting language, at best. But it gets the job done for data analysis purposes.↩︎

  5. The proposed solution does not need to be final.↩︎

  6. Consider the change from Python 2 to Python 3 as a cautionary tale.↩︎

  7. Fair warning: some coder communities can be … let us say, not overly welcoming of neophytes. It is not unusual for the answer to a question to be some variation on “look it up in the documentation”. While this can be true in a general sense, such an answer is useless. We all know that things can be looked up in the documentation. And we all know that some users ask questions without taking the time to think about things, or in the hope that somebody else will do their work for them. It is in the best interest of learners to seek communities that make a concerted effort to be healthy and inclusive, to recognize that not every user has reached the same proficiency level. Such communities are plentiful online; do not waste any time and energy on gatekeepers.↩︎

  8. What is XQuartz and why does macOS users need it?↩︎

  9. There are 3 others such symbols, but no language needs 5 assigners, let alone 2, so we will not introduce them here.↩︎

  10. That can cause unforeseen difficulties as it is not always easy to distinguish between a real number (numeric) and an integer visually. Furthermore, the digits of a number can be represented as character strings in some cases.↩︎

  11. The read.ssd() function will only work if SAS is installed locally, however.↩︎

  12. This is not a very interesting function as the standard multiplication * is already defined in R, but this is just an illustration of the functionality.↩︎

  13. See [1] for everything there is to know about pipelines and tidy data.↩︎

  14. We do not explicitly state the dplyr::xyz dependency since we already had to load the dplyr package to gain access to the pipeline operator |> in the first place.↩︎

  15. Note that these examples require Python 3.5 or higher.↩︎

  16. range provides an example of an iterable. One way to think of an iterable is that it provides a mechanism for generating a sequence of elements one at a time. The benefit is that range(100000), for example, does not take up much computation time since no actual element is generated until it is iterated over.↩︎

  17. The function is anonymous because it has no name.↩︎

  18. The number of observations can also be specified in the head() method.↩︎

  19. There are other means, see R Interface to Python and Five ways to work seamlessly between R and Python in the same project for more information), for instance↩︎

  20. Events can be represented graphically using Venn diagrams – mutually exclusive events are those which do not have a common intersection.↩︎

  21. This is a purely mathematical definition, but it agrees with the intuitive notion of independence in simple examples.↩︎

  22. Is it clear what is meant by ``independent tosses’’?↩︎

  23. What are some realistic values of \(p\)?↩︎

  24. There is nothing to that effect in the problem statement, so we have to make another set of assumptions.↩︎

  25. But why would we install a module which we know to be unreliable in the first place?↩︎

  26. For the purpose of these notes, a discrete set is one in which all points are isolated: \(\mathbb{N}\) and finite sets are discrete, but \(\mathbb{Q}\) and \(\mathbb{R}\) are not.↩︎

  27. Such as # of defects on a production line over a \(1\) hr period, # of customers that arrive at a teller over a \(15\) min interval, etc.↩︎

  28. Although it would still be a good idea to learn how to read and use them.↩︎

  29. In theory, this cannot be the true model as this would imply that some of the wait times could be negative, but it may nevertheless be an acceptable assumption in practice.↩︎

  30. The statement from the previous footnote applies here as well – we will assume that this is understood from this point onward.↩︎

  31. This level of precision is usually not necessary – it is often sufficient to simply present the interval estimate: \(a\in (1.64,1.65)\)↩︎

  32. The binomial probabilities are not typically available in textbooks (or online) for \(n=36\), although they could be computed directly in R, such as with pbinom(12,26,0.5)=0.0326.↩︎

  33. Note that the covariance could be negative, unlike the variance.↩︎

  34. If the scores did arise from a normal distribution, the \(\approx\) would be replaced by a \(=\).↩︎

  35. How would we verify that these distributions indeed have the right characteristics? How would we determine the appropriate parameters in the first place?↩︎

  36. Like the CLT, this is a limiting result.↩︎

  37. The probability density function of \(t(\nu)\) is \[f(x)=\frac{\Gamma(\nu/2+1/2)}{\sqrt{\pi \nu}\Gamma(\nu/2)(1+x^2/\nu)^{\nu/2+1/2}}.\]↩︎

  38. In statistical parlance, we say that 1 degree of freedom is lost when we use the sample to estimate the sample mean.↩︎

  39. Outlier analysis (and anomaly detection) is its own discipline – an overview is provided in Module @(ADOA).↩︎

  40. In theory, this definition only applies to normally distributed data, but it is often used as a first pass for outlier analysis even when the data is not normally distributed.↩︎

  41. In general, upper case letters are reserved for a general sample, and lower case letters for a specifically observed sample.↩︎

  42. This less than intuitive interpretation of the confidence interval is one of the disadvantages of using the frequentist approach; the analogous concept in Bayesian statistics is called the credible interval, which agrees with our naı̈ve expectation of a confidence interval as saying something about how certain we are that the true parameter is in the interval [26], [40].↩︎

  43. Sampling strategies can also help, but this is a topic for another module.↩︎

  44. Remember, when \(\sigma\) is known (and \(n\) is large enough), we already know from the CLT that \(Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\) is approximately \(\mathcal{N}(0,1).\)↩︎

  45. The crisis concerns the prevalence of positive findings that are contradicted in subsequent studies [41].↩︎

  46. “Even more extreme”, in this case, means further to the left, so that \(p\mbox{-value}=P(Z\leq z_0)=\Phi(z_0),\) where \(z_0\) is the observed value for the \(Z\)-test statistic.↩︎

  47. In order to avoid the controversy surrounding the crisis of replication?↩︎

  48. Which, it is worth recalling, is not the same as accepting the null hypothesis.↩︎

  49. That is to say, the treatment explains part of the difference in the observed group means.↩︎

  50. As the spread about the group means is fairly large (relatively-speaking), we suspect that the treatment-based model on its own does not capture all the variability in the data.↩︎

  51. If a difference is apparent and we cannot conclude that the variances are constant across groups, we need to apply a variance stabilising transformation, such as a logarithmic transformation or square-root transformation before proceeding.↩︎

  52. The medication may have strong side-effects which cannot be ignored.↩︎

  53. The difference may be due to the difficulty/high cost of data collection for some units excluded from the study population.↩︎

  54. Fancy footwork might be required to overcome the challenges presented by the guidelines, but that is par for the course.↩︎

  55. Be careful not to confuse the unit \(u_j\) with its response value \(u_j\).↩︎

  56. Will this always be the case?↩︎

  57. Recall that \(s^2\) is a biased estimator of \(\sigma^2\) in a SRS.↩︎

  58. The interval is “valid”, but it is perhaps too wide to be of practical use. We will discuss ways to improve the prediction in future sections.↩︎

  59. It is evidently not the one as \(\overline{y}_{\text{SRS}}\) is also such an estimator.↩︎

  60. We will continue the StS estimation procedure, for illustration purposes, but in practice, this is the stage at which we would require a different stratification or another sampling plan altogether.↩︎

  61. In general, we do not stratify with respect to the variable of interest, but with the help of auxiliary variables that are linked to the variable of interest.↩︎

  62. This corresponds to a tighter (smaller) C.I.↩︎

  63. As we have noticed several times, the confidence interval can of course change depending on the sample taken.↩︎

  64. Warning: Even if formal manipulations can still be performed, the estimate may not be valid if the relationship between the variables \(X\) and \(Y\) is not linear.↩︎

  65. I know, I know.↩︎

  66. There are, to be sure, important differences: quantitative consultants do not have to be data people, and the relationship between employers/stakeholders and employees (a position held by quite a few data scientists) is of a distinct nature than that between client and consultant, but there are enough similarities for the analogy to be useful. Failing that, it could be a good idea for data analysts and data scientists to get a sense for what motivates the consultants that might be brought in by their employers.↩︎

  67. Typically, the available time is quite short.↩︎

  68. It definitely was for the author of this document.↩︎

  69. Many newly-minted consultants and data scientists have not had enough experience with effective team work, and they are likely to underestimate the challenges that usually arise from such an endeavour.↩︎

  70. Note that individuals can play more than one role on a team.↩︎

  71. They may also need to shield the team from clients/stakeholders.↩︎

  72. Marketing is analogous to dating in this manner – you have to put yourself out there.↩︎

  73. Exactly what constitutes illegitimate behaviour is not always easy to determine, and may vary from one client to the next, but lies and misrepresentations are big no-nos.↩︎

  74. It is recommended that consultants stay up-to-date on these technologies; a principled stand against a new tech may garner support in an echo chamber, but it can also mark you as out-of-touch with a younger and more general audience.↩︎

  75. Note that if you are going to base an article off of a project, you should make sure to obtain client permission first.↩︎

  76. At the very least, consider wearing slacks/skirt, dress shirt, belt, dress shoes. After the first meeting, you can adjust as necessary.↩︎

  77. Ask for permission before recording anything.↩︎

  78. Nobody we have ever met, at least.↩︎

  79. The military imagery is intentional.↩︎

  80. Never call it a client error!↩︎

  81. In dating terms: will they still respect themselves in the morning?↩︎

  82. WARNING: nobody here is likely to be a lawyer. Get legal advice from actual lawyers, please.↩︎

  83. Note that, in Canada at least, the specifics of contracting and insurance depend on the jurisdictions in which the client and/or the consultants operate and in which the product/service is delivered.↩︎

  84. It is infinitely preferable to realize this before the contract is signed; the client in under no obligation to accommodate requests for extensions after an agreement has been reached.↩︎

  85. Implicit assumptions made at various stages, either by the consultant, the client, or both. Implicit assumptions are not necessarily invalid – problems arise when they are not shared by all parties (a gap which may only reliably be discovered by attempting to gather explicit information).↩︎

  86. See [51], [57], [58] and the entirety of your degree(s) for more information … as well as all the other modules in this book.↩︎

  87. And not a moment too soon, if you ask us.↩︎

  88. Let it be said one last time: the best academic or theoretical solution may not be an acceptable solution in practice.↩︎

  89. Code that does not work as it should when it should does not look very good on analysts and consultants.↩︎

  90. Surprisingly, this is a step that some consultants have a difficult time doing – a possible explanation of this bizarre phenomenon can be found in the accompanying video. ↩︎

  91. There are no right or wrong answer here – remember the dating analogy: consultants have agency.↩︎

  92. Fair warning: this process could be quite painful for the consultant/analyst’s ego. Introspection is one thing when it is done with the team; being criticized by the client can prove quite unpleasant, even when it is not done with malice.↩︎

  93. Names and identifying details have been removed to preserve privacy, but note the extent to which the dating analogy remains applicable.↩︎

  94. Some basic prep work can still be conducted, however, but not at the expense of projects that have officially been agreed to.↩︎

  95. Take the time to document attempts at reaching the client (email, phone calls, supervisors, etc.); this could come in handy at a later stage.↩︎

  96. That way, the client feels like they are doing something, and they may stop interrupting the team with unreasonable requests.↩︎

  97. We are not talking about miscommunication or honest mistakes, here – some clients have a track record of abusing consultants.↩︎

  98. These vehicles require a lot of administrative set-up on the part of the consultants, in Canada, at least [59].↩︎

  99. We’re not sure why that is the case, to be honest – if an organization does not trust its internal experts, they are not hiring the right employees, and that is entirely on them.↩︎

  100. There is nothing wrong with clients asking for the timeline to be revisited, and if the consultants can accommodate the new deadlines (in terms of resource availability), they should consider doing so. But the clients should not assume that a change is forthcoming just because the client’s deadlines have changed.↩︎

  101. Always in a polite manner, of course.↩︎

  102. Validation protocols should be in place, at any rate.↩︎

  103. Most consulting work is unsuitable for publication, in our experience.↩︎

  104. The dating analogy rears its head again: there are plenty of fish in the sea. All else being equal, clients prefer their consultants to be friendly rather than annoying.↩︎

  105. When we were students, there were barely any business applications for machine learning, for instance.↩︎

  106. This could be a gross generalization, but I cannot find any other reasonable explanation for the reticence that math/stats people have to engage in BD.↩︎

  107. The analytical work still has to be conducted properly, however!↩︎

  108. Crucially, this is a 2-way street: consultants also should be seeking clients they can trust.↩︎

  109. Consultants providing what has been agreed upon.↩︎

  110. As long as these originate with the consultant; when it is the client that asks for more, then there is the danger of scope creep.↩︎

  111. Flexibility is the consultant’s ally, however: there are instances where it makes more sense for the consultant to walk away (subject to contractual obligations, of course).↩︎

  112. Again with the dating analogy.↩︎

  113. Do we need to say it?↩︎

  114. Unless it has already been established that the consultant is away for a longer time period, which is allowed, of course – health and family first, always!↩︎

  115. Obviously, the rules differ from one language to the other.↩︎

  116. Ironic, we know.↩︎

  117. There are parallels with fashion and gastronomy: sometimes we need to wear a fancy suit for a special meal, sometimes we need a t-shirt and jeans, and a poutine.↩︎

  118. In practice, more complex databases are used.↩︎

  119. Or ‘On’ and ‘Off’, ‘TRUE’ and ‘FALSE’.↩︎

  120. Note that it also happens with small, well-organized, and easily contained projects. It happens all the time, basically.↩︎

  121. “Every model is wrong; some models are useful.” George Box.↩︎

  122. We are obviously not implying that these individuals have no ethical principles or are unethical; rather, that the opportunity to establish what these principles might be, in relation with their research, may never have presented itself.↩︎

  123. This is not to say that ethical issues have miraculously disappeared – Volkswagen, Whole Foods Markets, General Motors, Cambridge Analytica, and Ashley Madison, to name but a few of the big data science and data analysis players, have all recently been implicated in ethical lapses [110]. More dubious examples can be found in [111], [112].↩︎

  124. Truth be told, choosing wisely is probably the the most difficult aspect of a data science project.↩︎

  125. How long does it take Netflix to figure out that you no longer like action movies and want to watch comedies instead, say? How long does it take Facebook to recognize that you and your spouse have separated and that you do not wish to see old pictures of them in your feed?↩︎

  126. Questions can also be asked in an unsupervised manner, see [4], [137], among others, and Quantitative Methods, briefly.↩︎

  127. Unless we’re talking about quantum physics and then all bets are off – nobody has the slightest idea why things happen the way they do, down there.↩︎

  128. According to the adage, “data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.” (C.Stoll, attributed).↩︎

  129. We could facetiously describe ontologies as “data models on steroids.”↩︎

  130. “Times change, and we change with them.” C.Huberinus↩︎

  131. What does that make the other components?↩︎

  132. A similar approach underlies most of modern text mining, natural language processing, and categorical anomaly detection. Information usually gets lost in the process, which explains why meaningful categorical analyses tend to stay fairly simple.↩︎

  133. An equation for predicting weight from height could help identifying individuals who are possibly overweight (or underweight), say.↩︎

  134. In the first situation, the observations form a time series.↩︎

  135. For instance, The canonical equation \(\mathbf{X}^{\!\top}\mathbf{X}\mathbf{\beta}=\mathbf{X}^{\!\top}\mathbf{Y}\) of linear regression cannot be solved as \(\mathbf{X}^{\!\top}\mathbf{X}\) is not defined if some observations are missing.↩︎

  136. Imputation methods work best under MCAR or MAR, but keep in mind that they all tend to produce biased estimates.↩︎

  137. And such a fantastic person – in spite of her superior intellect, she is adored by all of her classmates, thanks to her sunny disposition and willingness to help at all times. If only all students were like Mary Sue…↩︎

  138. Or to simply re-enter the final grades by comparing with the physical papers…↩︎

  139. “There ain’t no such thing as a free lunch” – there is no guarantee that a method that works best for a dataset works even reasonably well for another.↩︎

  140. Outlying observations may be anomalous along any of the individual variables, or in combination.↩︎

  141. Anomaly detection points towards interesting questions for analysts and subject matter experts: in this case, why is there such a large discrepancy in the two populations?↩︎

  142. This stems partly from the fact that once the “anomalous” observations have been removed from the dataset, previously “regular” observations can become anomalous in turn in the smaller dataset; it is not clear when that runaway train will stop.↩︎

  143. Supervised models are built to minimize a cost function; in default settings, it is often the case that the mis-classification cost is assumed to be symmetrical, which can lead to technically correct but useless solutions. For instance, the vast majority (99.999+%) of air passengers emphatically do not bring weapons with them on flights; a model that predicts that no passenger is attempting to smuggle a weapon on board a flight would be 99.999+% accurate, but it would miss the point completely.↩︎

  144. Note that normality of the underlying data is an assumption for most tests; how robust these tests are against departures from this assumption depends on the situation.↩︎

  145. The default setting only lists a limited number of categorical levels – the summary documentation will explain how to increase the number of levels that are displayed.↩︎

  146. In a real-life setting, we should **definitely*8 verify that this assumption is valid.↩︎

  147. We know it’s not ’cause we looked it up. It’s one of the skills we learned in grade school.↩︎

  148. Please contact the authors if you discover missing or misattributed references.↩︎

  149. This is certainly the case with the Canada Revenue Agency’s My Account service, for instance.↩︎

  150. Not only because it’s bad practice, but also because the tasks may fail due to technical difficulties.↩︎

  151. It would be easy for us to jump on the anti-spreadmart bandwagon (to be honest, we mostly agree with the sentiment), but we are not prepared to claim that Excel should NEVER, EVER be used, under any circumstance.↩︎

  152. Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.↩︎

  153. Taking into account the size of the dataset.↩︎

  154. After filing, the amount they would receive in benefits (child benefits, GST/HST credits, etc.) is larger than what they would have to pay in taxes↩︎

  155. Pipelines with minimal delays.↩︎

  156. A data pipeline SLA is a contract between a client and the provider of a data service that is incorporated into the client pipeline.↩︎

  157. This increases efficiency, scalability and re-usability (see Big Data and Parallel Computing for a more in-depth discussion).↩︎

  158. “A data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives. The key focus areas of data governance include availability, usability, consistency, data integrity and data security, standard compliance and includes establishing processes to ensure effective data management throughout the enterprise such as accountability for the adverse effects of poor data quality and ensuring that the data which an enterprise has can be used by the entire organization.” [193]↩︎

  159. “DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.” [194]↩︎

  160. Most databases use a structured query language (SQL) for writing and querying data. SQL statements include: create/drop/alter table; select, insert, update, delete; where, like, order by, group by, count, having; join.↩︎

  161. “Database normalization is a technique for creating database tables with suitable columns and keys by decomposing a large table into smaller logical units. The process also considers the demands of the environment in which the database resides. Normalization is an iterative process. Commonly, normalizing a database occurs through a series of tests. Each subsequent step decomposes tables into more manageable information, making the overall database logical and easier to work with.” [195]↩︎

  162. This would be akin to looking for a needle in the world’s largest haystack!↩︎

  163. This was written in August 2022; that list is liable to have changed quite a lot since then.↩︎

  164. In the sense of considering all business requirements, such as federated data sources, need for scale, critical implications of real-time data ingestion or transformation, online feature engineering, handling upgrades, monitoring, etc.↩︎

  165. Note that this is not the same thing as asking whether we should design such algorithms.↩︎

  166. Note that this is not the same as understanding why a mushroom is poisonous or edible – the data alone cannot provide an answer to that question.↩︎

  167. A mycologist could perhaps deduce the answer from these features alone, but she would be using her experience with fungi to make a prediction, and so would not be looking at the features in a vacuum.↩︎

  168. It would have had Amanita muscaria’s habitat been ‘leaves’.↩︎

  169. The marketing team is banking on the fact that customers are unlikely to shop around to get the best deal on hot dogs AND buns, which may or may not be a valid assumption.↩︎

  170. There will be times when an interest of 0.11 in a rule would be considered a smashing success; a lift of 15 would not be considered that significant but a support of 2% would be, and so forth.↩︎

  171. The final trajectories were validated using the full sampling procedure.↩︎

  172. Again, with feeling: correlation does not imply causation.↩︎

  173. Value estimation (regression) is similar to classification, except that the target variable is numerical instead of categorical.↩︎

  174. ID3 would never be used in a deployment setting, but it will serve to illustrate a number of classification concepts.↩︎

  175. Classical performance evaluation metrics can easily be fooled; if out of two classes one of the instances is only represented in 0.01% of the instances, predicting the non-rare class will yield correct predictions roughly 99.99% of the time, missing the point of the exercise altogether.↩︎

  176. The relative small size of the dataset should give data analysts pause for thought, at the very least.↩︎

  177. Is it possible to look at Figure 11.23 without assigning labels or trying to understand what type of customers were likely to be young and have medium income? Older and wealthier?↩︎

  178. The order in which the data is presented can play a role, as can starting configurations.↩︎

  179. To the point that the standard joke is that “it’s not necessary to be a gardener to become a data analyst, but it helps”.↩︎

  180. Note that the iris dataset has started being phased out in favour of the penguin dataset [233], for reasons that do not solely have to do with its overuse (hint: take a look at the name of the journal that published Fisher’s paper).↩︎

  181. This threshold is difficult to establish exactly, however.↩︎

  182. We could argue that the data was simply not representative – using a training set with redheads would yield a rule that would make better predictions. But “over-reporting/overconfidence” (which manifest themselves with the use of significant digits) is also part of the problem.↩︎

  183. “It’s the best data we have!” does not mean that it is the right data, or even good data.↩︎

  184. For instance, can we use a model that predicts whether a borrower will default on a mortgage or not to also predict whether a borrower will default on a car loan or not? The problem is compounded by the fact that there might be some link between mortgage defaults and car loan defaults, but the original model does not necessarily takes this into account.↩︎

  185. The package is called rpart, the function… also rpart().↩︎

  186. There are other types, such as semi-supervised or reinforcement learning, but these are topics for future modules.↩︎

  187. The response variable \(\mathbf{Y}\) that was segregated away from \(\mathbf{X}\) in the supervised learning case could now be one of the variables in \(\mathbf{X}\).↩︎

  188. Why?↩︎

  189. In particular, if \(\widehat{Y}=f(\vec{X})\), then \(\widehat{Y}\approx Y=f(\vec{X})+\varepsilon\)).↩︎

  190. The proportion must be large enough to bring the variance down.↩︎

  191. In this context, “parametric” means that assumptions are made about the form of the regression function \(f\); “non-parametric” means that no such assumptions are made.↩︎

  192. We will revisit this concept at a later stage.↩︎

  193. In reality, machine learning is simply applied optimization; the proof of this important result is outside the scope of this document (but see [220], [239] for details).↩︎

  194. Failure to do so means that the model can at best be used to describe the training dataset (which might still be a valuable contribution).↩︎

  195. Although it would be surprising if the performance on the test data performance is any good if the performance on the training data is middling. We shall see at a later stage that the training/testing paradigm can also help with problems related to overfitting.↩︎

  196. New test observations may end up assuming the same values as some of the training observations, but that is an accident of sampling or due to the reality of the scenario under consideration.↩︎

  197. Note that \(\mathbf{X}^{\!\top}\mathbf{X}\) is a \(p\times p\) matrix, which makes the inversion relatively easy to compute even when \(n\) is large.↩︎

  198. That is, when impose structure on the learners.↩︎

  199. If \(Y\) represents the total monetary value in a piggy bank, \(X_1\) the number of coins, and \(X_2\) the number of pennies, what is likely to be the sign of \(\beta_2\) in the model \(Y=\beta_0+\beta_1X_1+\beta_2X_2+\varepsilon\)? Are \(X_1\) and \(X_2\) correlated? What would the interpretation look like, in this case?↩︎

  200. Compare with the Bayesian notion of a credible interval (see Module 18).↩︎

  201. These are distributions have probability density functions that satisfy \[f(\mathbf{x}\mid \vec{\theta})=h(\mathbf{x})g(\vec{\theta})\exp(\vec{\phi}(\vec{\theta})\cdot \vec{T}(\mathbf{x})).\] This includes the normal, binomial, Poisson, Gamma distributions, etc. These are all distributions with conjugate priors (see Module 18).↩︎

  202. For orthonormal covariates \(\mathbf{X}^{\!\top}\mathbf{X}=I_p\), we have, in fact: \[\widehat{\beta}_{\textrm{RR},j}=\frac{\widehat{\beta}_{\textrm{OLS},j}}{1+N\lambda}.\]↩︎

  203. For orthonormal covariates, we have \[\widehat{\beta}_{\textrm{BS},j}=\begin{cases} 0 & \text{if $|\widehat{\beta}_{\textrm{LS,j}}|<\sqrt{N\lambda}$} \\ \widehat{\beta}_{\textrm{LS,j}} & \text{if $|\widehat{\beta}_{\textrm{LS,j}}|\geq\sqrt{N\lambda}$}\end{cases}\]↩︎

  204. For orthonormal covariates, we have \[\widehat{\beta}_{\textrm{L},j}=\widehat{\beta}_{\textrm{OLS},j}\cdot \max \left(0,1-\frac{N\lambda}{|\widehat{\beta}_{\textrm{OLS},j}|}\right).\]↩︎

  205. Some methods make direct adjustments to the training error rate in order to estimate the test error (e.g., Mallow’s \(C_p\) statistic, \(R^2_a\), AIC, BIC, etc.)↩︎

  206. It can also provide a basis for model selection.↩︎

  207. For a regression model, there are many options but we typically use \[E_k=\sum_{i\in \mathcal{C}_k} \frac{(y_i-\widehat{y}_i)^2}{n_k}.\]↩︎

  208. The estimate is usually biased, anyway.↩︎

  209. Note that the estimates for \(\beta_0\), \(\beta_1\), and \(\text{MSE}_\text{Te}\) are likely to be correlated from one fold to the next, respectively, since the respective training sets share a fair number of observations.↩︎

  210. Note that this is not as straightforward as one might think, so caution is advised.↩︎

  211. For instance, sampling with replacement at the observation level would not preserve the covariance structure of time series data.↩︎

  212. In practice, linear models have distinct advantages over more sophisticated models, mainly in the areas of superior interpretability and (frequently) appropriate predictive performances (especially for linearly separable data). These “old faithful” models will still be there if fancy deep learning model fails analysts in the future.↩︎

  213. We cannot use \(\text{SSRes}\) or \(R^2\) as metrics in this last step, as we would always select \(\mathcal{M}_p\) since \(\text{SSRes}\) decreases monotonically with \(k\) and \(R^2\) increases monotonically with \(k\). Low \(\text{SSRes}\)/high \(R^2\) are associated with a low training error, whereas the other metrics attempt to say something about the test error, which is what we are after: after all, a model is good if it makes good predictions!↩︎

  214. We are assuming that all models are OLS models, but subset selection algorithms can be used for other families of supervised learning methods; all that is required are appropriate training error estimates for step 2b and test error estimates for step 3.↩︎

  215. “When presented with competing hypotheses about the same prediction, one should select the solution with the fewest assumptions.”↩︎

  216. Thus a 95% \(\text{C.I.}\) can be built just as with polynomial and other regressions.↩︎

  217. Use colnames() or str() to list all the variables.↩︎

  218. If \(h\) represents a straight line, say, the penalty term would be zero.↩︎

  219. The code that would allow for a different random sample every time the code is run has been commented out in the following code box.↩︎

  220. See Module 13.↩︎

  221. The curse of dimensionality is also in play when \(p\) becomes too large.↩︎

  222. The notion of proximity depends on the distance metric in use; the Euclidean case is the most common, but it does not have to be that one.↩︎

  223. Their number is a measure of a model’s complexity.↩︎

  224. Remember however that we have not evaluated the performance of the models on a testing set \(\text{Te}\); we have only described some of its behaviour on the training set \(\text{Tr}\).↩︎

  225. There is another way in which OLS could fail, but it has nothing to do with the OLS assumptions per se. When the set of qualitative responses contains more than \(2\) level (such as \(\mathcal{C}=\{\text{low},\text{medium},\text{high}\}\), for instance), the response is usually encoded using numerals to facilitate the implementation of the analysis: \[Y=\begin{cases}0 & \text{if low} \\ 1 & \text{if medium} \\ 2 & \text{if high} \end{cases}\] This encoding suggests an ordering and a scale between the levels (for instance, the difference between “high” and “medium” is equal to the difference between “medium” and “low”, and half again as large as the difference between “high” and “low”). OLS is not appropriate in this context.↩︎

  226. The probit transformation uses \(g_P(y^*)=\Phi(y^*)\), where \(\Phi\) is the cumulative distribution function of \(\mathcal{N}(0,1)\).↩︎

  227. Any other predictor distribution could be used if it is more appropriate for \(\text{Tr}\), and we could assume that the standard deviations or the means (or both) are identical across classes.↩︎

  228. \(p\) parameters for each \(\hat{\boldsymbol{\mu}}_k\) and \(1+2+\cdots+p\) parameters for \(\hat{\boldsymbol{\Sigma}}\).↩︎

  229. \(p\) parameters for each \(\hat{\boldsymbol{\mu}}_k\) and \(1+2+\cdots+p\) parameters for each \(\hat{\boldsymbol{\Sigma}}_k\).↩︎

  230. Recall that all supervised learning tasks are optimization problems.↩︎

  231. Multiple splitting criteria are used in practice, such as insisting that all final nodes contain 10 or fewer observations, etc.↩︎

  232. This is similar to the bias-variance trade-off or the regularization framework: a good tree balances considerations of fit and complexity.↩︎

  233. The tree is not unique, obviously, but any other tree with separators parallel to the axes will only be marginally better, at best.↩︎

  234. To be sure, we could create an intricate decision tree with \(>2^2=4\) separating lines, but that is undesirable for a well-fitted tree.↩︎

  235. Perfect separation would lead to overfitting.↩︎

  236. \(F(\mathbf{x})\) for points on \(H_{\boldsymbol{\beta},\beta_0}\).↩︎

  237. Technically speaking we do not need to invoke the representer theorem in the linear separable case. At any rate, the result is out-of-scope for this document.↩︎

  238. By analogy with positive definite square matrices, this means that \(\sum_{i,j=1}^Nc_ic_jK(\mathbf{x}_i,\mathbf{x}_j)\geq 0\) for \(\mathbf{x}_i\in \mathbb{R}^p\), \(c_j\in \mathbb{N}\).↩︎

  239. This might seem to go against reduction strategies used to counter the curse of dimensionality; the added dimensions are needed to “unfurl” the data, so to speak.↩︎

  240. In Section 13.5, we argue that it is usually preferable to train a variety of models, rather than just the one.↩︎

  241. The actual values of \(T(\mathbf{x};\boldsymbol{\alpha})\) have no intrinsic meaning, other than their relative ordering.↩︎

  242. In essence, a neural network is a function.↩︎

  243. Although grayscale images have only a single colour channel and could thus be stored in 2D tensors, by convention image tensors are always 3D, with a one-dimensional colour channel for grayscale images.↩︎

  244. The gradient is the derivative of a tensor operation; it generalizes the notion of the derivative to functions of multidimensional inputs.↩︎

  245. A beautiful (created by A. Radford) compares the performance of different optimization algorithms and shows that the methods usually take different paths to reach the minimum.↩︎

  246. Nobody disputes the validity of Bayes’ Theorem, and it has proven to be a useful component in various models and algorithms, such as email spam filters, and the following example, but the use of Bayesian statistics is controversial in many quarters.↩︎

  247. In other problems, the predictors could be continuous rather than discrete, in which case we would use continuous distributions instead; even in discrete case, the multinomial assumption might not be appropriate.↩︎

  248. Low variance methods, in comparison, are those for which the results, structure, predictions, etc. remain roughly similar when using different training sets, such as OLS when \(N/p\gg 1\), and are less likely to benefit from the use of ensemble learning.↩︎

  249. The AdaBoost code on the Two-Moons dataset was lifted from an online source whose location cannot be found at the moment.↩︎

  250. Formally, a kernel is a symmetric (semi-)positive definite operator \(K:\mathbb{R}^p\times \mathbb{R}^p\to \mathbb{R}_0^+\). By analogy with positive definite square matrices, this means that \(\sum_{i,j=1}^Nc_ic_jK(\mathbf{x}_i,\mathbf{x}_j)\geq 0\) for all \(\mathbf{x}_i\in \mathbb{R}^p\) and all \(c_j\in \mathbb{R}_+\), and \(K(\mathbf{x},\mathbf{w})=K(\mathbf{w},\mathbf{x})\) for all \(\mathbf{x},\mathbf{w}\in \mathbb{R}^p\).↩︎

  251. Think free-ranging robots, roughly speaking.↩︎

  252. We agree that this might be a straw-man definition of “post-modernist subjectivity”, but perhaps it is not that much of one, in the end; all things being equal, we lean more toward the objective side of things, in nature and in data analysis.↩︎

  253. In this section, we borrow heavily from [3].↩︎

  254. Computing the number of such partitions in general cannot be done by elementary means, but it is to show that the number is bounded above by \(n^k\).↩︎

  255. Unfortunately, the clustering results depend very strongly on the initial randomization – a “poor” selection can yield arbitrarily “bad” (sub-optimal) results; \(k-\)means\(++\) selects the initial centroids so as to maximize the chance that they will be well-spread in the dataset (which also speeds up the run-time).↩︎

  256. The results might look good on a 2-dimensional representation of the data, but how do we know it could not look better?↩︎

  257. With \(n\) observations, there are \(1+\cdots+(n-1)=\frac{(n-1)n}{2}\) such pairs.↩︎

  258. Note that each object has multiple dimensions, or attributes available for comparison.↩︎

  259. While we cannot forget that they are not actual apples, we will assume that this is understood and simply refer to the objects as fruit, or apples.↩︎

  260. An important consideration, from a general data science perspective, is whether the signature vector provides a sufficient description of the associated object or whether it is too crude to be of use. This is usually difficult to ascertain prior to obtaining analysis results, and comparing them to the “reality” of the underling system (see Modules 6 and 7 for details).↩︎

  261. Keep in mind that different similarity measures may yield various results, in some cases showing the two apples to be similar, in others to be dissimilar.↩︎

  262. While the moniker “distance” harkens back to the notion of Eulidean (physical) distance between points in space, it is important to remember that the measurements refer to the distance between the associated signature vectors, which do not necessarily correspond to their respective physical locations.↩︎

  263. Or in the case of soft clustering, assign each instance a “probability” of belonging to each cluster.↩︎

  264. The similarity matrix is typically required at both stages.↩︎

  265. The specifics of that function are not germane to the current discussion and so are omitted.↩︎

  266. “Clustering validation” suggests that there is an ideal clustering result against which to compare the various algorithmic outcomes, and all that is needed is for analysts to determine how much the outcomes depart from the ideal result. “Cluster quality” is a better way to refer to the process.↩︎

  267. Given that all of them are supposedly provide context-free assessments of clustering quality, that is problematic (although emblematic of unsupervised endeavours).↩︎

  268. The formula for \(\text{RI}(\mathcal{A},\mathcal{B})\) reminds one of the definition of accuracy, a performance evaluation measure for (binary) classifiers.↩︎

  269. In a nutshell, the expected value of \(\text{RI}(\mathcal{A},\mathcal{B})\) for independent, random clusterings \(\mathcal{A}\) and \(\mathcal{B}\) is not 0 [274].↩︎

  270. Which it is emphatically not, it bears repeating.↩︎

  271. in the 4-cluster case, half a cluster seems to have been mis-assigned, for instance.↩︎

  272. These concepts are covered in just enough depth to provide an intuition about the algorithm.↩︎

  273. This cannot be the entire story, however, as we can minimize the total weight of broken edges by simply … not cutting any edges. Indeed, there are other approaches: Normalized Cut (actually used in practice), Ratio Cut, Min-Max Cut, etc.↩︎

  274. The spectral MinCut solution is not guaranteed to be the true MinCut solution, but it usually is close enough to be an acceptable approximation.↩︎

  275. For more information about this abstraction, which actually relates a variant of Kernel PCA to spectral clustering, consult [279].↩︎

  276. Since the product of symmetric matrices is not necessarily symmetric.↩︎

  277. This is not the same as the minimum cut which represents the cut that minimizes the number of edges separating two vertices, but instead represents the minimum ratio of edges across the cut divided by the number of vertices in the smaller half of the partition.↩︎

  278. DBSCAN can also fit within that framework, by picking a similarity method based on the radius that allows the graph separate into different components. Then the multiplicity of \(\lambda_0=0\) in the Laplacian gives the number of graph components, and these can be further clustered, as above.↩︎

  279. We borrow extensively from Deng and Han’s Probabilistic Models for Clustering chapter in [4].↩︎

  280. This notation can be generalized to fuzzy clusters: the cluster signature of \(\mathbf{x}_j\) is \[\mathbf{z}_j\in [0,1]^k,\quad \|\mathbf{z}_j\|_2=1;\] if \(\mathbf{z}_j=(0,0,\tfrac{1}{\sqrt{2}},\tfrac{1}{\sqrt{2}},0),\) say, then we would interpret \(\mathbf{x}_j\) as belonging equally to clusters \(C_3\) and \(C_4\) or as having probability \(1/2\) of belonging to either \(C_3\) or \(C_4\).↩︎

  281. The mclust vignette contins more information.↩︎

  282. As candidate exemplars are themselves observations, we can also compute self-responsibility: \(r(k,k) \leftarrow s(k,k)-\max_{k\neq k'} \{s(k,k')\}.\)↩︎

  283. The centroid of the \(\ell\)th cluster is the weighted average of ALL observations by the degree to which they belong to cluster \(\ell\).↩︎

  284. One major challenge with hypergraph partitioning is that a hyperedge can be “broken” by a partitioning in many different ways, not all of which are qualitatively equivalent. Most hypergraph partitioning algorithms use a constant penalty for breaking a hyperedge.↩︎

  285. The distribution of the membership of different instances to the meta-partitions can be used to determine its meta-cluster membership, or soft assignment probability.↩︎

  286. This simple assumption is rather old-fashioned and would be disputed by many in the age of hockey analytics, but let it stand for now.↩︎

  287. Unfortunately for this lifelong Sens fan, it most definitely would…↩︎

  288. This section also serves as an introduction to Text Analysis and Text Mining.↩︎

  289. An entire field of statistical endeavour – statistical survey sampling – has been developed to quantify the extent to which the sample is representative of the population, see Survey Sampling Methods.↩︎

  290. For instance, if we are interested in predicting the number of passengers per flight leaving YOW (Macdonald-Cartier International Airport) and the total population of passengers is sampled, then the sampled number of passengers per flight is necessarily below the actual number of passengers per flight. Estimation methods exist to overcome these issues.↩︎

  291. The situation may not be as stark if the observations are not i.i.d., but the principle remains the same – in high-dimensional spaces, it is harder for observations to be near one another than it is so in low-dimensional spaces.↩︎

  292. Although there are scenarios where it could be those “small” axes that are more interesting – such as is the case with the “pancake stack” problem.↩︎

  293. If some of the eigenvalues are 0, \(r<p\), and vice-versa, implying that the data was embedded in a \(r-\)dimensional manifold to begin with.↩︎

  294. Which we assume encompasses all of this work’s readership…↩︎

  295. This error reconstruction approach to PCA yields the same results as the covariance approach of the previous section [2].↩︎

  296. These kernels also appear in support vector machines (see Section 13.4.2).↩︎

  297. Excluding \(\mathbf{x}_i\) itself.↩︎

  298. As with LLE, the edges of \(\mathcal{G}\) can be obtained by finding the \(k\) nearest neighbours of each node, or by selecting all points within some fixed radius \(\varepsilon\).↩︎

  299. The first component in the similarity metric measures how likely it is that \(\mathbf{x}_i\) would choose \(\mathbf{x}_j\) as its neighbour if neighbours were sampled from a Gaussian centered at \(\mathbf{x}_i\), for all \(i,j\).↩︎

  300. This usually requires there to be a value to predict, against which the features can be evaluated for relevance; we will discuss this further in Regression and Value Estimation and Spotlight on Classification.↩︎

  301. Either a threshold on the ranking or on the ranking metric value itself.↩︎

  302. This can be quite difficult to determine.↩︎

  303. As filtering is a pre-processing step, proper analysis would also require building a model using this subset of features.↩︎

  304. For instance, for a \(p-\)distance \(\delta\), set \[H^{\delta}(x_{i,j})=\arg\min_{\pi_j(\mathbf{z})}=\left\{\delta(\mathbf{x}_i,\mathbf{z})\mid \text{class}(\mathbf{x}_i)=\text{class}(\mathbf{z})\right\}\] and \[M^{\delta}(x_{i,j})=\arg\min_{\pi_j(\mathbf{z})}=\left\{\delta(\mathbf{x}_i,\mathbf{z})\mid \text{class}(\mathbf{x}_i)\neq \text{class}(\mathbf{z})\right\}.\]↩︎

  305. Matrix factorization techniques have applications to other data analytic tasks; notably, they can be used to impute missing values and to build recommender systems.↩︎

  306. Each singular value is the principal square root of the corresponding eigenvalue of the covariance matrix \(\mathbf{X}^{\!\top}\mathbf{X}\) (see Section 15.2.3).↩︎

  307. Sparse vectors whose entries are 0 or 1, based on the identity of the words and POS tags under consideration.↩︎

  308. “Ye shall know a word by the company it keeps”, as the distributional semantics saying goes. The term “kumipwam” is not found in any English dictionary, but its probable meaning as “a small beach/sand lizard” could be inferred from its presence in sentences such as “Elowyn saw a tiny scaly kumipwam digging a hole on the beach”. It is easy to come up with examples where the context is ambiguous, but on the whole the contextual approach has proven itself to be mostly reliable.↩︎

  309. The problem of selecting \(M\) is tackled as it is in PCA regression.↩︎

  310. Or that their interactions are negligible.↩︎

  311. We have encountered some of these concepts in Section 14.4.2.↩︎

  312. One can think of this as the “reach” of each point.↩︎

  313. For image processing, this kernel is often used with \(\alpha=c=1\).↩︎

  314. It is often used in high-dimensional applications such as text mining.↩︎

  315. In certain formulations, the entries of the adjacency matrix \(A\) are instead defined to take on the value 1 or 0, depending as to whether the similiarity between the corresponding observations is greater than (or smaller than) some pre-determined threshold \(\tau\).
    ↩︎

  316. Remember, the eigenvectors act as functions in this viewpoint. For a given eigenvector \(\lambda_j\), the contour value at each point \(\mathbf{x}_i\) is the value of the associated eigenvector \(\xi_j\) in the \(i^{\text{th}}\) position, namely \(\xi_{j,i}\). For any point \(\mathbf{x}\) not in the dataset, the contour value is given by averaging the \(\xi_{j,k}\) of the observations \(\mathbf{x}_k\) near \(\mathbf{x}\), inversely weighted by the distances \(\|\mathbf{x}_k-\mathbf{x}\|\).↩︎

  317. As a reminder, the eigenvalues themselves are ordered in increasing sequence: for the current example, \[\begin{aligned} \lambda_{1}=0 \leq \lambda_{2}&=1.30 \times 10^{-2}\leq \lambda_{3}= 3.94\times 10^{-2} \leq \cdots\lambda_{20}=2.95\leq\cdots\end{aligned}\]↩︎

  318. In the remainder of this section, the subscript is dropped. Note that \(q\) is assumed, not found by the process.↩︎

  319. Careful: the correct Python package to install is umap-learn, not umap.↩︎

  320. Outlying observations may be anomalous along any of the individual variables, or in combinations of variables.↩︎

  321. Which, by the way, should always be seen as a welcomed development.↩︎

  322. Note that normality of the underlying data is an assumption for most tests; how robust these tests are against departures from this assumption depends on the situation.↩︎

  323. Before carrying out seasonal adjustment, it is important to identify and pre-adjust for structural breaks (using the Chow test, for instance), as their presence can give rise to severe distortions in the estimation of the Trend and Seasonal effects. Seasonal breaks occur when the usual seasonal activity level of a particular time reporting unit changes in subsequent years. Trend breaks occurs when the trend in a data series is lowered or raised for a prolonged period, either temporarily or permanently. Sources of these breaks may come from changes in government policies, strike actions, exceptional events, inclement weather, etc.↩︎

  324. X12 is implemented in SAS and R, among other platforms.↩︎

  325. The simplest way to determine whether to use multiplicative or additive decomposition is by graphing the time series. If the size of the seasonal variation increases/decreases over time, multiplicative decomposition should be used. On the other hand, if the seasonal variation seems to be constant over time, additive model should be used. A pseudo-additive model should be used when the data exhibits the characteristics of the multiplicative series, but parameter values are close to zero.↩︎

  326. Nevertheless, the analyst for whom the full picture is important might want to further evaluate the algorithm with the help of the Matthews Correlation Coefficient [320] or the specificity \(s=\frac{\text{TN}}{\text{FP}+\text{TN}}\).↩︎

  327. This is not the case for the parameters in general clustering algorithms: if the elements of \(D\) are \(n-\)dimensional, the only restriction is that \(m\geq n+1\) (larger values of \(m\) allow for better noise identification).↩︎

  328. While we are on the topic, regression on categorical variables is called multinomial logistic regression.↩︎

  329. What does this assume, if anything at all, about the features’ independence.↩︎

  330. Strictly speaking, the AVF score would be minimized when each of the observation’s features’ levels occur zero time in the dataset, but then … the observation would not actually be in the dataset. ↩︎

  331. The available methods are all methods that we have not discussed: HDoutliers() from the package HDoutliers, FastPCS() from the package FastPCS, mvBACON() from robustX, adjOutlyingness() and covMcd() from robustbase, and DectectDeviatingCells() from cellWise.↩︎

  332. It is EXTREMELY IMPORTANT that these flaws not simply be swept under the carpet; they need to be addressed, and the analysis outcomes that result must be presented or reported on with an appropriate caveat.↩︎

  333. The R equivalent is rvest; we will not describe how to use it, but you are strongly encouraged to read up on this versatile tool and to use it in the Exercises.↩︎

  334. A full list of R API libraries can be found here.↩︎

  335. Wikipedia is a commonly-used source of data on various topics (in a first pass, at the very least), but it should probably not be your ONLY source of information.↩︎

  336. Such as would be used in mathematical reasoning.↩︎

  337. Modern Bayesian statistics is still based on formulating probability distributions to express uncertainty about unknown quantities. These can be underlying parameters of a system (induction) or future observations (prediction). Bayesian statistics is a system for describing epistemiological uncertainty using the mathematical language of probability; Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result with a probability distribution on the parameters of the model and on unobserved quantities (such as predictions).↩︎

  338. The integral of these priors over the positive quadrant is infinite.↩︎

  339. We use a different seed, so the charts are slightly different, but the main ideas hold.↩︎

  340. Would we expect there to be more bills in circulation, given these observations, in the brittle case or the simple case?↩︎

  341. We use a different seed, so the charts are slightly different, but the main ideas hold.↩︎

  342. We will work with the logarithms of all quantities, so that the likelihood is a sum and not a product as would usually be the case.↩︎

  343. The algorithm may be used to sample from any integrable function.↩︎

  344. In the worst case scenario, \(M\) would have to be smaller than the total amount of wealth available to humanity throughout history, although in practice \(M\) should be substantially smaller. Obviously, a different argument will need to be made in the case \(M=\infty\).↩︎