# References

*R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. O’Reilly Media, 2017.

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.*Springer, 2008.

*An Introduction to Statistical Learning: With Applications in R*. Springer, 2014.

*Data Clustering: Algorithms and Applications*. CRC Press, 2014.

*Data Mining: The Textbook*. Cham: Springer, 2015.

*Data Classification: Algorithms and Applications*. CRC Press, 2015.

*The Science of Discworld*. Ebury Publishing, 2002.

*R in Action*, Second. Manning, 2015.

*Using R and RStudio for Data Management, Statistical Analysis, and Graphics, 2nd edition*. Taylor & Francis, 2015.

*R Programming for Data Science*. Lulu.com, 2012.

*A Learning Guide to R*. Scribd, 2017.

*Python Data Science Handbook : Essential tools for working with data*. Sebastopol, CA: O’Reilly Media, Inc, 2016.

*Python for Data Analysis : Agile tools for real-world data*. Sebastopol, CA: O’Reilly, 2013.

*Data Wrangling with Python: Tips and tools to make your life easier*. O’Reilly Media, 2016.

*R Markdown Cookbook*. Boca Raton, Florida: Chapman; Hall/CRC, 2020.

*Introduction to Linear Optimization*, 1st ed. Athena Scientific, 1997.

*Nonlinear Programming*. Athena Scientific, 1999.

*Math. Program.*, vol. 112, no. 1, pp. 3–44, Mar. 2008.

*Computational mathematical programming*, 1985, pp. 25–53.

*Ann. Oper. Res.*, vol. 221, no. 1, pp. 273–283, 2014.

*Journal of Productivity Analysis*, vol. 22, no. 1, pp. 143–161, 2004.

*Data Science Report Series*, 2020.

*Probability Theory: the Logic of Science*. Cambridge Press, 2003.

*Foundations of The Theory of Probability*. Chelsea Publishing Company, 1933.

*Probability and Statistics for Engineers and Scientists*, 8th ed. Pearson Education, 2007.

*Probability and Statistical Inference*, 7th ed. Pearson/Prentice Hall, 2006.

*The Analysis of Variance: Fixed, Random and Mixed Models*. Birkhäuser, 2000.

*Applied Linear Statistical Models*. McGraw Hill Irwin, 2004.

*Nonparametric Statistical Methods*, 2nd ed. Wiley, 1999.

*Practical Statistics for Data Scientists: 50 Essential Concepts*. O’Reilly, 2017.

*Data Analysis: A Bayesian Tutorial (2nd ed.)*. Oxford Science, 2006.

*Statistical Computing with R*. CRC Press, 2007.

*Statistics Done Wrong: the Woefully Complete Guide*. No Starch Press, 2015.

*Data analysis: A Bayesian tutorial (2nd ed.)*. Oxford Science, 2006.

*Statistics in Biopharmaceutical Research*, vol. 13, no. 1, pp. 6–18, 2021.

*Survey Methods and Practices, Catalogue no.12-587-X*. Statistics Canada.

*Survey Methodology*, vol. 19, no. 1, pp. 81–94, 1993.

*Advanced sampling methods*. Springer Nature Singapore, 2021.

*Méthodes de sondage pour les enquêtes statistiques agricoles*. Rome: FAO. Développement Statistique.

*Sampling: Design and Analysis*. Duxbury Press, 1999.

*Gödel, Escher, Bach: an Eternal Golden Braid*. New York, NY: Basic Books, 1979.

*Managing the Professional Services Firm*. Free Press, 1993.

*Marketing Your Services: For People Who Hate to Sell*. McGraw-Hill, 2002.

*The Trusted Advisor*. Free Press, 2001.

*Mastering Effective English*, 4th ed. Copp Clark Professional, 1980.

*American Scientist*, vol. Volume 78, 1990.

*Style: Ten Lessons in Clarity and Grace*. Pearson, 2004.

*The Bedford Handbook*, 9th ed. Bedford, 2013.

*Data Mining with R, 2nd ed.*CRC Press, 2016.

*Harvard Business Review*, Oct. 2012.

*The Telegraph*, May 2018.

*PLOS Medicine*, vol. 15, no. 11, pp. 1–19, 2018, doi: 10.1371/journal.pmed.1002699.

*Slashdot.com*, Oct. 2018.

*Newswire.com*, Jun. 2015.

*Curbed*, May 2017.

*MIT Technology Review*, Dec. 2018.

*Inverse*, Jul. 2018.

*Science Daily*, Sep. 2015.

*ZDNet*, Oct. 2013.

*ABC Science*, Oct. 2018.

*Washington Post*, Nov. 2018.

*The Atlantic*, Oct. 2018.

*Reuters*, Oct. 2018.

*New York Times*, Dec. 2018.

*Wall Street Journal*, Nov. 2018.

*South China Morning Post*, May 2017.

*Smithsonian Magazine*, Mar. 2016.

*CNET*, Sep. 2017.

*TechCrunch*, Sep. 2017.

*Against the grain: A deep history of the earliest states*. New Haven: Yale University Press, 2017.

*Business Insider*, Dec. 2015.

*Doing Data Science: Straight Talk from the Front Line*. O’Reilly, 2013.

*Data Science Report Series*, 2020.

*Harvard Business Review*, Nov. 2015.

*Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown, 2016.

*The Nation*, Sep. 2013.

*Understanding the Foundations of Ethical Reasoning*, 2nd ed. Foundation for Critical Thinking, 2006.

*The Transparent Society: Will Technology Force Us to Choose Between Privacy and Freedom?*Perseus, 1998.

*The Expanse*. Orbit Books, 2011--2021.

*ACM SIGCAS Computers and Society*, vol. 45, no. 3, pp. 118–125, 2015.

*The Independent*, Apr. 2017.

*The Root*, 2016.

*Foundation series*. Gnome Press, Spectra, Doubleday, 1942--1993.

*Nature*, vol. 535, 2016.

*ICWSM*.

*et al.*, “Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients,”

*Nature Communications*, vol. 5, 2014, doi: 10.1038/ncomms5022.

*Real World Data Mining Applications*, Cham: Springer International Publishing, 2015, pp. 221–245. doi: 10.1007/978-3-319-07812-0_12.

*J. Mach. Learn. Res.*, vol. 7, pp. 1963–2001, Dec. 2006.

*Int. J. Sen. Netw.*, vol. 8, no. 3/4, pp. 202–208, Oct. 2010, doi: 10.1504/IJSNET.2010.036195.

*ICADIWT*, 2014, pp. 207–212.

*et al.*, “Automated detection of brain atrophy patterns based on MRI for the prediction of alzheimer’s disease,”

*NeuroImage*, vol. 50, no. 1, pp. 162–174, 2010.

*Journal of the American Medical Informatics Association*, vol. 5, no. 4, pp. 373–381, Jul. 1998, doi: 10.1136/jamia.1998.0050373.

*Journal of Personality and Social Psychology*, vol. 114, no. 2, pp. 246–257, Feb. 2018.

*KDnuggets.com*, 2017.

*Variance Explained*, Jan. 2018.

*Forbes*, Mar. 2012.

*Data Science for Business*. O’Reilly, 2015.

*et al.*, “An improved ontological representation of dendritic cells as a paradigm for all cell types,”

*BMC Bioinformatics*, 2009.

*et al.*, “Mapping the electrostatic force field of single molecules from high-resolution scanning probe images,”

*Nature Communications*, vol. 7, no. 11560, 2016.

*Practical Data Visualization*. Data Action Lab/Quadrangle, 2022.

*Reinforcement Learning: an Introduction*. MIT Press, 2018.

*Deep Learning*. MIT press Cambridge, 2016.

*Data Science Report Series*, 2020.

*A missing information principle: Theory and applications*. University of California Press, 1972.

*Survey Methodology*, vol. 27, no. 1, pp. 85–95, 2001.

*Flexible imputation of missing data*. CRC Press, 2012.

*Multiple imputation for nonresponse in surveys*. Wiley, 1987.

*Data Science Report Series*, 2020.

*Data Science Report Series*, 2007.

*Data Science Report Series*, 2020.

*Beautiful Evidence*. Graphics Press, 2008.

*Lexical Distance of European Languages*. Etymologikon, 2008.

*The Functional Art*. New Riders, 2013.

*The Truthful Art*. New Riders, 2016.

*FlowingData*.

*Design for Information*. Rockport, 2013.

*List of Physical Visualizations and Related Artifacts*.

*Cause and Effect*, D. Lerner, Ed. New York: Free Press, 1965, pp. 75–98.

*Proc R Soc Med*, vol. 58, no. 5, pp. 295–300, 1965.

*Data Fluency: Empowering Your Organization with Effective Data Communication*. Wiley, 2014.

*A Guide to Creating Dashboards People Love to Use*. (ebook).

*The Big Book of Dashboards*. Wiley, 2017.

*The Visual Display of Quantitative Information*. Graphics Press, 2001.

*Storytelling with Data*. Wiley, 2015.

*ggplot2: Elegant Graphics for Data Analysis*. Springer, 2021.

*Journal of Computational and Graphical Statistics*, no. 19, pp. 3–28, 2009.

*Journal of Statistical Software*, vol. 59, no. 10, 2014.

*R Graphics Cookbook*. O’Reilly, 2013.

*Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale*. O’Reilly Media, 2018.

*Foundations for Architecting Data Solutions: Managing Successful Data Projects*. O’Reilly Media, 2018.

*Designing Data-Intensive applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems*. O’Reilly Media, 2017.

*Database Design*. BCCampus, 2014.

*Bayesian Reasoning and Machine Learning*. Cambridge Press, 2012.

*Predictive analytics: The power to predict who will click, buy, lie or die*. Predictive Analytics World, 2016.

*IEEE Transactions on Knowledge and Data Engineering*, vol. 15, no. 1, pp. 57–69, 2003, doi: 10.1109/TKDE.2003.1161582.

*Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems*, 1998, pp. 18–24. doi: 10.1145/275487.275490.

*Inf. Syst.*, vol. 29, no. 4, pp. 293–313, Jun. 2004, doi: 10.1016/S0306-4379(03)00072-3.

*CoRR*, vol. abs/0803.0966, 2008.

*Towards Data Science*, Oct. 2020.

*Mining of Massive Datasets*. Cambridge Press, 2014.

*Kaggle.com*, 2016.

*et al.*, “Click fraud detection: Adversarial pattern recognition over 5 years at microsoft,” in

*Annals of information systems (special issue on data mining in real-world applications)*, Springer, 2015, pp. 181–201. doi: 10.1007/978-3-319-07812-0.

*et al.*, “Detection of anomalous particles from deepwater horizon oil spill using SIPPER3 underwater imaging platform,” in

*Data mining case studies IV, proceedings of the 11th IEEE international conference on data mining*, Vancouver, BC: IEEE, 2011.

*Marketing Science Conference*, 2005.

*Statistical Learning with Sparsity: The LASSO and Generalizations*. CRC Press, 2015.

*Data Action Lab Blog*, 2019.

*Le choix bayésien - principes et pratique*. Springer-Verlag France, 2006.

*Large Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction*. Cambridge University Press, 2010.

*Surviving a Disaster, in Numsense!*algobeans, 2016.

*Neural Computation*, vol. 8, no. 7, pp. 1341–1390, 1996, doi: 10.1162/neco.1996.8.7.1341.

*IEEE Transactions on Evolutionary Computation*, vol. 9, no. 6, pp. 721–735, 2005, doi: 10.1109/TEVC.2005.856205.

*Statistical models in s*. Wadsworth; Brooks/Cole, 1992.

*ACM Trans. Database Syst.*, vol. 42, no. 3, Jul. 2017, doi: 10.1145/3068335.

*Scientific American (Online)*, Sep. 2016.

*Complex Adapt. Syst. Model.*, vol. 4, p. 8, 2016, doi: 10.1186/s40294-016-0020-0.

*et al.*, “A comparison of antioxidant, antibacterial, and anticancer activity of the selected thyme species by means of hierarchical clustering and principal component analysis,”

*Acta Chromatographica Acta Chromatographica*, vol. 28, no. 2, pp. 207–221, 2016, doi: 10.1556/achrom.28.2016.2.7.

*ICADIWT*, 2014, pp. 207–212.

*Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems*, 2011, pp. 357–360. doi: 10.1145/2093973.2094022.

*Computational science and its applications - ICCSA 2011*, 2011, pp. 454–465.

*ClusterCrit: Clustering Indices*. 2018.

*ICWSM*, 2011.

*Annals of Eugenics*, vol. 7, no. 7, pp. 179–188, 1936.

*Towards Data Science*, Jun. 2020.

*Bad Data Handbook*. O’Reilly, 2013.

*Business Intelligence and Data Mining*. Business Expert Press, 2015.

*Hands on Machine Learning with R*. CRC Press.

*Factfulness: Ten reasons we’re wrong about the world - and why things are better than you think*. Hodder & Stoughton, 2018.

*The health and wealth of nations*. Gapminder Foundation, 2012.

*IEEE Transactions on Evolutionary Computation*, 1997.

*Journal of Technometrics*, vol. 8, no. 4, pp. 625–629, Nov. 1966.

*Statistical Learning with Sparsity : the LASSO and Generalizations*. CRC Press, 2015.

*Deep Learning with Python*, 1st ed. USA: Manning Publications Co., 2017.

*Journal of Artificial Intelligence Research*, vol. 16, pp. 321–357, 2002.

*Tree-Based Machine Learning Algorithms: Decision Trees, Random Forests, and Boosting*. CreateSpace Independent Publishing Platform, 2017.

*Annals of Statistics*, vol. 36, no. 3, pp. 1171–1220, 2008.

*Support Vector Machines: Optimization Based Theory, Algorithms, and Extensions*. CRC Press/Chapman; Hall, 2013.

*Mind*, 1950.

*AI Expert*, vol. 2, no. 12, pp. 46–52, Dec. 1987.

*ICLR*, 2015.

*PLOS ONE*, vol. 10, no. 11, pp. 1–21, Nov. 2015, doi: 10.1371/journal.pone.0141357.

*Annals of Statistics*, vol. 28, p. 2000, 1998.

*Towards Data Science*, Feb. 2021.

*Towards Data Science*, Mar. 2021.

*Data Science Report Series*, 2021.

*International Journal of Global Warming*, vol. 11, p. 38, Jan. 2017, doi: 10.1504/IJGW.2017.080989.

*IET Software*, vol. 11, Jun. 2017, doi: 10.1049/iet-sen.2016.0261.

*Pattern Recognition*, vol. 43, pp. 445–456, Feb. 2010, doi: 10.1016/j.patcog.2009.03.004.

*Transportation Research Part E: Logistics and Transportation Review*, vol. 43, pp. 687–709, Nov. 2007, doi: 10.1016/j.tre.2006.04.004.

*Environment and Planning B: Planning and Design*, vol. 36, pp. 865–882, Sep. 2009, doi: 10.1068/b34111t.

*Food Science and Technology Research*, vol. 8, pp. 281–285, Aug. 2002, doi: 10.3136/fstr.8.281.

*Transportation Research Procedia*, vol. 22, pp. 265–274, Dec. 2017, doi: 10.1016/j.trpro.2017.03.033.

*J. Classif.*, vol. 32, no. 1, pp. 46–62, Apr. 2015, doi: 10.1007/s00357-015-9167-1.

*Stat. Anal. Data Min.*, vol. 3, pp. 209–235, 2010.

*Cognitive Science*, vol. 34, 2012.

*J. Mach. Learn. Res.*, vol. 11, pp. 2837–2854, Dec. 2010.

*Foundations and Trends in Machine Learning*, vol. 2, no. 3, pp. 235–274, 2010, doi: 10.1561/2200000008.

*Inf. Retr.*, vol. 12, no. 5, p. 613, 2009.

*Advances in Neural Information Processing Systems (NIPS 2002): 2002*, Jun. 2003.

*Proceedings of the 14th international conference on neural information processing systems: Natural and synthetic*, 2001, pp. 849–856.

*Stat. Comput.*, vol. 17, no. 4, pp. 395–416, 2007.

*Pattern Recognition*, vol. 43, no. 12, pp. 4069–4076, 2010, doi: https://doi.org/10.1016/j.patcog.2010.06.015.

*Advances in Neural Information Processing Systems*, 2005, vol. 17.

*Ph.D. Thesis*, Jan. 2009.

*Science (New York, N.Y.)*, vol. 315, pp. 972–6, Mar. 2007, doi: 10.1126/science.1136800.

*Environmental Microbiome*, vol. 15, no. 16, 2020.

*1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes*, pp. 761–766, 1978.

*Finding Groups in Data: An Introduction to Cluster Analysis*. John Wiley, 1990.

*Electronics Letters*, vol. 57, no. 21, pp. 792–794, 2021.

*Proceedings of the 2005, american control conference, 2005.*, 2005, pp. 1120–1125.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 13, no. 8, pp. 841–847, 1991.

*Science*, vol. 290, no. 5500, p. 2319, 2000.

*Science*, vol. 290, no. 5500, pp. 2323–2326, 2000.

*Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis*. Springer International Publishing, 2015.

*towardsdatascience.com*, 2018.

*Neural Network Methods for Natural Language Processing*. Morgan; Claypool, 2017.

*Spectral Feature Selection for Data Mining*. CRC Press, 2011.

*Advances in neural information processing systems*, 2005, vol. 18.

*Learning Theory and Kernel Machines*, 2003, pp. 144–158.

*The Princeton Companion to Mathematics*. Princeton University Press, 2008.

*arXiv preprint*, 2018.

*Anomaly Detection Principles and Algorithms*. Springer, 2017.

*Complexity*, 2019, doi: 10.1155/2019/8460934.

*et al.*, “Mining chemical activity status from high-throughput screening assays,”

*PloS one*, vol. 10, no. 12, 2015, doi: 10.1371/journal.pone.0144426.

*Outlier Analysis*. Springer International Publishing, 2016.

*Artif. Intell. Rev.*, vol. 22, no. 2, pp. 85–126, 2004.

*SIGMOD Rec.*, vol. 30, no. 2, pp. 37–46, 2001, doi: http://doi.acm.org/10.1145/376284.375668.

*2008 IEEE 24th International Conference on Data Engineering Workshop*, 2008, pp. 600–603. doi: 10.1109/ICDEW.2008.4498387.

*Outlier Ensembles: An Introduction*. Springer International Publishing, 2017.

*Proceedings of the Eighth IEEE International Conference on Data Mining*, 2008, pp. 413–422.

*IEEE Transactions on Knowledge and Data Engineering*, 2019.

*Proceedings of the Second International Conference on Knowledge Discovery and Data Mining*, 1996, pp. 226–231.

*Advances in Knowledge Discovery and Data Mining*, 2013, pp. 160–172.

*SIGMOD Rec.*, vol. 29, no. 2, pp. 93–104, 2000.

*Proceedings of the Thirtieth International Conference on Very Large Data Bases*, 2004, pp. 1265–1268.

*Advances in Knowledge Discovery and Data Mining*, 2009, pp. 831–838.

*2012 IEEE 12th International Conference on Data Mining*, 2012, pp. 529–538. doi: 10.1109/ICDM.2012.112.

*Proceedings of the 19th ACM International Conference on Information and Knowledge Management*, 2010, pp. 1629–1632.

*Proceedings of the 2011 IEEE 27th International Conference on Data Engineering*, 2011, pp. 434–445. doi: 10.1109/ICDE.2011.5767916.

*Journal of the American Statistical Association*, vol. 88, pp. 284–297, 1993.

*Machine Learning with Go*. Packt Publishing, 2017.

*Knowl. Inf. Syst.*, vol. 11, no. 1, pp. 45–84, Jan. 2007.

*Advances in Knowledge Discovery and Data Mining*, 2006, pp. 567–576.

*Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining*, 2005, pp. 157–166. doi: 10.1145/1081870.1081891.

*Journal de la Société Française de Statistique*, vol. 159, no. 3, pp. 1–39, 2018.

*Advances in Web-Age Information Management*, 2005, pp. 632–637.

*Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining*, 2nd ed. Wiley Publishing, 2015.

*Web Scraping with Python: Collecting Data From the Modern Web*, 2nd ed. O’Reilly Media, 2018.

*Phil. Trans. of the Royal Soc. of London*, vol. 53, pp. 370–418, 1763.

*American Journal of Physics*, vol. 14, no. 1, 1946.

*The Signal and the Noise*. Penguin, 2012.

*Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd ed.)*. Academic Press, 2011.

*Bayesian data analysis (3rd ed.)*. CRC Press, 2013.

*Trends Cogn.Sci.*, 2006.

*Introduction to Bayesian Data Analysis (course notes)*. Department of Statistics, University of South Carolina, 2014.

*Nyt Tidsskrift for Matematik B*, 1909.

*Queueing Theory and Applications*, 2nd ed. PWS/Kent Publishing, 2002.

*Queueing Systems, Volume I*. Wiley, 1974.

*Introduction to Probability Models*, 11th ed. San Diego, CA, USA: Academic Press, 2014.

*Applications of Queueing Theory*. Springer Netherlands, 2013.

*The Annals of Mathematical Statistics*, vol. 24, no. 3, pp. 338–354, 1953, doi: 10.1214/aoms/1177728975.

*Operations Research: Applications and Algorithms*. Cengage Learning, 2022.

*Interfaces*, vol. 4, no. 4, pp. 47–51, Aug. 1974.

In all cases, we have attempted to properly cite and give credit where it is due. Get in touch if you find omissions!↩︎

We’re not saying that we won’t be adding examples in different languages in the future, but let’s not get ahead of ourselves.↩︎

In the parlance of the field, let us simply say that some of the details are left as an exercise for the reader (and can also be found in the numerous references).↩︎

Most programmers do not consider

`R`

to be a programming language. If they are feeling generous, they might dub it a scripting language, at best. But it gets the job done for data analysis purposes.↩︎The proposed solution does not need to be final.↩︎

Consider the change from

`Python`

2 to`Python`

3 as a cautionary tale.↩︎**Fair warning:**some coder communities can be … let us say, not overly welcoming of neophytes. It is not unusual for the answer to a question to be some variation on “look it up in the documentation”. While this can be true in a general sense, such an answer is useless. We all know that things can be looked up in the documentation. And we all know that some users ask questions without taking the time to think about things, or in the hope that somebody else will do their work for them. It is in the best interest of learners to seek communities that make a concerted effort to be healthy and inclusive, to recognize that not every user has reached the same proficiency level. Such communities are plentiful online; do not waste any time and energy on gatekeepers.↩︎There are 3 others such symbols, but no language needs 5 assigners, let alone 2, so we will not introduce them here.↩︎

That can cause unforeseen difficulties as it is not always easy to distinguish between a real number (

*numeric*) and an*integer*visually. Furthermore, the digits of a number can be represented as character strings in some cases.↩︎The

`read.ssd()`

function will only work if SAS is installed locally, however.↩︎This is not a very interesting function as the standard multiplication

`*`

is already defined in`R`

, but this is just an illustration of the functionality.↩︎See [1] for everything there is to know about pipelines and tidy data.↩︎

We do not explicitly state the

`dplyr::xyz`

dependency since we already had to load the`dplyr`

package to gain access to the pipeline operator`|>`

in the first place.↩︎Note that these examples require

`Python 3.5`

or higher.↩︎`range`

provides an example of an iterable. One way to think of an iterable is that it provides a mechanism for generating a sequence of elements one at a time. The benefit is that`range(100000)`

, for example, does not take up much computation time since no actual element is generated until it is iterated over.↩︎The function is anonymous because it has no name.↩︎

The number of observations can also be specified in the

`head()`

method.↩︎There are other means, see R Interface to Python and Five ways to work seamlessly between R and Python in the same project for more information), for instance↩︎

Events can be represented graphically using Venn diagrams – mutually exclusive events are those which do not have a common intersection.↩︎

This is a purely mathematical definition, but it agrees with the intuitive notion of independence in simple examples.↩︎

Is it clear what is meant by ``independent tosses’’?↩︎

What are some realistic values of \(p\)?↩︎

There is nothing to that effect in the problem statement, so we have to make another set of assumptions.↩︎

But why would we install a module which we know to be unreliable in the first place?↩︎

For the purpose of these notes, a discrete set is one in which all points are

**isolated**: \(\mathbb{N}\) and finite sets are discrete, but \(\mathbb{Q}\) and \(\mathbb{R}\) are not.↩︎Such as # of defects on a production line over a \(1\) hr period, # of customers that arrive at a teller over a \(15\) min interval, etc.↩︎

Although it would still be a good idea to learn how to read and use them.↩︎

In theory, this cannot be the true model as this would imply that some of the wait times could be negative, but it may nevertheless be an acceptable assumption in practice.↩︎

The statement from the previous footnote applies here as well – we will assume that this is understood from this point onward.↩︎

This level of precision is usually not necessary – it is often sufficient to simply present the interval estimate: \(a\in (1.64,1.65)\)↩︎

The binomial probabilities are not typically available in textbooks (or online) for \(n=36\), although they could be computed directly in

`R`

, such as with`pbinom(12,26,0.5)=0.0326`

.↩︎Note that the covariance could be negative, unlike the variance.↩︎

If the scores did arise from a normal distribution, the \(\approx\) would be replaced by a \(=\).↩︎

How would we verify that these distributions indeed have the right characteristics? How would we determine the appropriate parameters in the first place?↩︎

Like the CLT, this is a

**limiting**result.↩︎The probability density function of \(t(\nu)\) is \[f(x)=\frac{\Gamma(\nu/2+1/2)}{\sqrt{\pi \nu}\Gamma(\nu/2)(1+x^2/\nu)^{\nu/2+1/2}}.\]↩︎

In statistical parlance, we say that 1

**degree of freedom**is lost when we use the sample to estimate the sample mean.↩︎Outlier analysis (and anomaly detection) is its own discipline – an overview is provided in Module @(ADOA).↩︎

In theory, this definition only applies to

**normally distributed**data, but it is often used as a first pass for outlier analysis even when the data is not normally distributed.↩︎In general, upper case letters are reserved for a general sample, and lower case letters for a specifically observed sample.↩︎

This less than intuitive interpretation of the confidence interval is one of the disadvantages of using the frequentist approach; the analogous concept in Bayesian statistics is called the

**credible interval**, which agrees with our naı̈ve expectation of a confidence interval as saying something about how certain we are that the true parameter is in the interval [26], [40].↩︎Sampling strategies can also help, but this is a topic for another module.↩︎

Remember, when \(\sigma\) is known (and \(n\) is large enough), we already know from the CLT that \(Z=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\) is approximately \(\mathcal{N}(0,1).\)↩︎

The crisis concerns the prevalence of positive findings that are contradicted in subsequent studies [41].↩︎

“Even more extreme”, in this case, means further to the left, so that \(p\mbox{-value}=P(Z\leq z_0)=\Phi(z_0),\) where \(z_0\) is the observed value for the \(Z\)-test statistic.↩︎

In order to avoid the controversy surrounding the crisis of replication?↩︎

Which, it is worth recalling, is not the same as accepting the null hypothesis.↩︎

That is to say, the treatment explains part of the difference in the observed group means.↩︎

As the spread about the group means is fairly large (relatively-speaking), we suspect that the treatment-based model on its own does not capture all the variability in the data.↩︎

If a difference is apparent and we cannot conclude that the variances are constant across groups, we need to apply a

**variance stabilising transformation**, such as a**logarithmic transformation**or**square-root transformation**before proceeding.↩︎The medication may have strong side-effects which cannot be ignored.↩︎

The difference may be due to the

**difficulty**/**high cost**of data collection for some units excluded from the study population.↩︎Fancy footwork might be required to overcome the challenges presented by the guidelines, but that is par for the course.↩︎

Be careful not to confuse the unit \(u_j\) with its response value \(u_j\).↩︎

Will this always be the case?↩︎

Recall that \(s^2\) is a biased estimator of \(\sigma^2\) in a SRS.↩︎

The interval is “valid”, but it is perhaps too wide to be of practical use. We will discuss ways to improve the prediction in future sections.↩︎

It is evidently not the one as \(\overline{y}_{\text{SRS}}\) is also such an estimator.↩︎

We will continue the StS estimation procedure, for illustration purposes, but in practice, this is the stage at which we would require a different stratification or another sampling plan altogether.↩︎

In general, we do not stratify with respect to the variable of interest, but with the help of auxiliary variables that are linked to the variable of interest.↩︎

This corresponds to a tighter (smaller) C.I.↩︎

As we have noticed several times, the confidence interval can of course change depending on the sample taken.↩︎

**Warning:**Even if formal manipulations can still be performed, the estimate may not be valid**if the relationship between the variables \(X\) and \(Y\) is not linear**.↩︎I know, I know.↩︎

There are, to be sure, important differences: quantitative consultants do not have to be data people, and the relationship between employers/stakeholders and employees (a position held by quite a few data scientists) is of a distinct nature than that between client and consultant, but there are enough similarities for the analogy to be useful. Failing that, it could be a good idea for data analysts and data scientists to get a sense for what motivates the consultants that might be brought in by their employers.↩︎

Typically, the available time is quite short.↩︎

It definitely was for the author of this document.↩︎

Many newly-minted consultants and data scientists have not had enough experience with

**effective team work**, and they are likely to underestimate the challenges that usually arise from such an endeavour.↩︎Note that individuals can play more than one role on a team.↩︎

They may also need to shield the team from clients/stakeholders.↩︎

Marketing is analogous to dating in this manner –

**you have to put yourself out there**.↩︎Exactly what constitutes illegitimate behaviour is not always easy to determine, and may vary from one client to the next, but lies and misrepresentations are big no-nos.↩︎

It is recommended that consultants

**stay up-to-date**on these technologies; a principled stand against a new tech may garner support in an echo chamber, but it can also mark you as**out-of-touch**with a younger and more general audience.↩︎Note that if you are going to base an article off of a project, you should make sure to obtain

**client permission**first.↩︎At the very least, consider wearing slacks/skirt, dress shirt, belt, dress shoes. After the first meeting, you can adjust as necessary.↩︎

Ask for permission before recording anything.↩︎

Nobody we have ever met, at least.↩︎

The military imagery is intentional.↩︎

Never call it a client error!↩︎

In dating terms: will they still respect themselves in the morning?↩︎

WARNING: nobody here is likely to be a lawyer. Get legal advice from actual lawyers, please.↩︎

Note that, in Canada at least, the specifics of contracting and insurance depend on the jurisdictions in which the client and/or the consultants operate and in which the product/service is delivered.↩︎

It is infinitely preferable to realize this

**before**the contract is signed; the client in under no obligation to accommodate requests for extensions after an agreement has been reached.↩︎**Implicit**assumptions made at various stages, either by the consultant, the client, or both. Implicit assumptions are not necessarily invalid – problems arise when they are not shared by all parties (a gap which may only reliably be discovered by attempting to gather explicit information).↩︎See [51], [57], [58] and the entirety of your degree(s) for more information … as well as all the other modules in this book.↩︎

And not a moment too soon, if you ask us.↩︎

Let it be said one last time: the best academic or theoretical solution may not be an acceptable solution in practice.↩︎

Code that does not work as it should when it should does not look very good on analysts and consultants.↩︎

Surprisingly, this is a step that some consultants have a difficult time doing – a possible explanation of this bizarre phenomenon can be found in the accompanying video. ↩︎

There are no right or wrong answer here – remember the dating analogy: consultants have

**agency**.↩︎Fair warning: this process could be quite painful for the consultant/analyst’s ego. Introspection is one thing when it is done with the team; being criticized by the client can prove quite unpleasant, even when it is not done with malice.↩︎

Names and identifying details have been removed to preserve privacy, but note the extent to which the dating analogy remains applicable.↩︎

Some basic prep work can still be conducted, however, but not at the expense of projects that have officially been agreed to.↩︎

Take the time to document attempts at reaching the client (email, phone calls, supervisors, etc.); this could come in handy at a later stage.↩︎

That way, the client feels like they are doing something, and they may stop interrupting the team with unreasonable requests.↩︎

We are not talking about miscommunication or honest mistakes, here – some clients have a track record of abusing consultants.↩︎

These vehicles require a lot of administrative set-up on the part of the consultants, in Canada, at least [59].↩︎

We’re not sure why that is the case, to be honest – if an organization does not trust its internal experts, they are not hiring the right employees, and that is entirely on them.↩︎

There is nothing wrong with clients asking for the timeline to be revisited, and if the consultants can accommodate the new deadlines (in terms of resource availability), they should consider doing so. But the clients should not assume that a change is forthcoming just because the client’s deadlines have changed.↩︎

Always in a polite manner, of course.↩︎

Validation protocols should be in place, at any rate.↩︎

Most consulting work is unsuitable for publication, in our experience.↩︎

The dating analogy rears its head again: there are plenty of fish in the sea. All else being equal, clients prefer their consultants to be friendly rather than annoying.↩︎

When we were students, there were barely any business applications for machine learning, for instance.↩︎

This could be a gross generalization, but I cannot find any other reasonable explanation for the reticence that math/stats people have to engage in BD.↩︎

The analytical work still has to be conducted properly, however!↩︎

Crucially, this is a 2-way street: consultants also should be seeking clients they can trust.↩︎

Consultants providing what has been agreed upon.↩︎

As long as these originate with the consultant; when it is the client that asks for more, then there is the danger of scope creep.↩︎

Flexibility is the consultant’s ally, however: there are instances where it makes more sense for the consultant to walk away (subject to contractual obligations, of course).↩︎

Again with the dating analogy.↩︎

Do we need to say it?↩︎

Unless it has already been established that the consultant is away for a longer time period, which is allowed, of course – health and family first, always!↩︎

Obviously, the rules differ from one language to the other.↩︎

Ironic, we know.↩︎

There are parallels with fashion and gastronomy: sometimes we need to wear a fancy suit for a special meal, sometimes we need a t-shirt and jeans, and a poutine.↩︎

In practice, more complex

**databases**are used.↩︎Or ‘On’ and ‘Off’, ‘TRUE’ and ‘FALSE’.↩︎

Note that it also happens with small, well-organized, and easily contained projects. It happens all the time, basically.↩︎

“Every model is wrong; some models are useful.”

*George Box*.↩︎We are obviously not implying that these individuals have no ethical principles or are unethical; rather, that the opportunity to establish what these principles might be, in relation with their research, may never have presented itself.↩︎

This is not to say that ethical issues have miraculously disappeared – Volkswagen, Whole Foods Markets, General Motors, Cambridge Analytica, and Ashley Madison, to name but a few of the big data science and data analysis players, have all recently been implicated in ethical lapses [110]. More dubious examples can be found in [111], [112].↩︎

Truth be told, choosing wisely is probably the the most

**difficult**aspect of a data science project.↩︎How long does it take Netflix to figure out that you no longer like action movies and want to watch comedies instead, say? How long does it take Facebook to recognize that you and your spouse have separated and that you do not wish to see old pictures of them in your feed?↩︎

Questions can also be asked in an

**unsupervised**manner, see [4], [137], among others, and Quantitative Methods, briefly.↩︎Unless we’re talking about quantum physics and then all bets are off – nobody has the slightest idea why things happen the way they do, down there.↩︎

According to the adage, “data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.” (C.Stoll, attributed).↩︎

We could facetiously describe ontologies as “data models on steroids.”↩︎

“Times change, and we change with them.”

*C.Huberinus*↩︎What does that make the other components?↩︎

A similar approach underlies most of modern text mining, natural language processing, and categorical anomaly detection. Information usually gets lost in the process, which explains why meaningful categorical analyses tend to stay fairly simple.↩︎

An equation for predicting weight from height could help identifying individuals who are possibly overweight (or underweight), say.↩︎

In the first situation, the observations form a

**time series**.↩︎For instance, The canonical equation \(\mathbf{X}^{\!\top}\mathbf{X}\mathbf{\beta}=\mathbf{X}^{\!\top}\mathbf{Y}\) of linear regression cannot be solved as \(\mathbf{X}^{\!\top}\mathbf{X}\) is not defined if some observations are missing.↩︎

Imputation methods work best under MCAR or MAR, but keep in mind that they all tend to produce

**biased estimates**.↩︎And such a fantastic person – in spite of her superior intellect, she is adored by all of her classmates, thanks to her sunny disposition and willingness to help at all times. If only all students were like Mary Sue…↩︎

Or to simply re-enter the final grades by comparing with the physical papers…↩︎

“There ain’t no such thing as a free lunch” – there is no guarantee that a method that works best for a dataset works even reasonably well for another.↩︎

Outlying observations may be anomalous along any of the individual variables, or in combination.↩︎

Anomaly detection points towards interesting questions for analysts and subject matter experts: in this case, why is there such a large discrepancy in the two populations?↩︎

This stems partly from the fact that once the “anomalous” observations have been removed from the dataset, previously “regular” observations can become anomalous in turn in the smaller dataset; it is not clear when that runaway train will stop.↩︎

Supervised models are built to minimize a cost function; in default settings, it is often the case that the mis-classification cost is assumed to be symmetrical, which can lead to technically correct but useless solutions. For instance, the vast majority (99.999+%) of air passengers emphatically do not bring weapons with them on flights; a model that predicts that no passenger is attempting to smuggle a weapon on board a flight would be 99.999+% accurate, but it would miss the point completely.↩︎

Note that

**normality**of the underlying data is an assumption for most tests; how robust these tests are against departures from this assumption depends on the situation.↩︎The default setting only lists a limited number of categorical levels – the

`summary`

documentation will explain how to increase the number of levels that are displayed.↩︎In a real-life setting, we should **definitely*8 verify that this assumption is valid.↩︎

We know it’s not ’cause we looked it up. It’s one of the skills we learned in grade school.↩︎

Please contact the authors if you discover missing or misattributed references.↩︎

This is certainly the case with the

*Canada Revenue Agency’s My Account*service, for instance.↩︎Not only because it’s bad practice, but also because the tasks may fail due to technical difficulties.↩︎

It would be easy for us to jump on the anti-spreadmart bandwagon (to be honest, we mostly agree with the sentiment), but we are not prepared to claim that Excel should NEVER, EVER be used, under any circumstance.↩︎

Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.↩︎

Taking into account the size of the dataset.↩︎

After filing, the amount they would receive in benefits (child benefits, GST/HST credits, etc.) is larger than what they would have to pay in taxes↩︎

Pipelines with minimal delays.↩︎

A data pipeline SLA is a contract between a client and the provider of a data service that is incorporated into the client pipeline.↩︎

This increases efficiency, scalability and re-usability (see Big Data and Parallel Computing for a more in-depth discussion).↩︎

“A data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives. The key focus areas of data governance include availability, usability, consistency, data integrity and data security, standard compliance and includes establishing processes to ensure effective data management throughout the enterprise such as accountability for the adverse effects of poor data quality and ensuring that the data which an enterprise has can be used by the entire organization.” [193]↩︎

“DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations.” [194]↩︎

Most databases use a

**structured query language**(SQL) for writing and querying data. SQL statements include: create/drop/alter table; select, insert, update, delete; where, like, order by, group by, count, having; join.↩︎“Database normalization is a technique for creating database tables with suitable columns and keys by decomposing a large table into smaller logical units. The process also considers the demands of the environment in which the database resides. Normalization is an iterative process. Commonly, normalizing a database occurs through a series of tests. Each subsequent step decomposes tables into more manageable information, making the overall database logical and easier to work with.” [195]↩︎

This would be akin to looking for a needle in the world’s largest haystack!↩︎

This was written in August 2022; that list is liable to have changed quite a lot since then.↩︎

In the sense of considering all business requirements, such as federated data sources, need for scale, critical implications of real-time data ingestion or transformation, online feature engineering, handling upgrades, monitoring, etc.↩︎

Note that this is not the same thing as asking whether we

*should*design such algorithms.↩︎Note that this is not the same as understanding

**why**a mushroom is poisonous or edible – the data alone cannot provide an answer to that question.↩︎A mycologist could perhaps deduce the answer from these features alone, but she would be using her experience with fungi to make a prediction, and so would not be looking at the features in a

*vacuum*.↩︎It would have had

**Amanita muscaria**’s habitat been ‘leaves’.↩︎The marketing team is banking on the fact that customers are unlikely to shop around to get the best deal on hot dogs AND buns, which may or may not be a valid assumption.↩︎

There will be times when an interest of 0.11 in a rule would be considered a smashing success; a lift of 15 would not be considered that significant but a support of 2% would be, and so forth.↩︎

The final trajectories were validated using the full sampling procedure.↩︎

Again, with feeling:

**correlation does not imply causation.**↩︎**Value estimation**(regression) is similar to classification, except that the target variable is numerical instead of categorical.↩︎ID3 would never be used in a deployment setting, but it will serve to illustrate a number of classification concepts.↩︎

Classical performance evaluation metrics can easily be fooled; if out of two classes one of the instances is only represented in 0.01% of the instances, predicting the non-rare class will yield correct predictions roughly 99.99% of the time, missing the point of the exercise altogether.↩︎

The relative small size of the dataset should give data analysts pause for thought, at the very least.↩︎

Is it possible to look at Figure 11.23 without assigning labels or trying to understand what type of customers were likely to be young and have medium income? Older and wealthier?↩︎

The order in which the data is presented can play a role, as can starting configurations.↩︎

To the point that the standard joke is that “it’s not necessary to be a gardener to become a data analyst, but it helps”.↩︎

Note that the iris dataset has started being phased out in favour of the penguin dataset [233], for reasons that do not solely have to do with its overuse (hint: take a look at the name of the journal that published Fisher’s paper).↩︎

This threshold is difficult to establish exactly, however.↩︎

We could argue that the data was simply not representative – using a training set with redheads would yield a rule that would make better predictions. But “over-reporting/overconfidence” (which manifest themselves with the use of significant digits) is also part of the problem.↩︎

“It’s the best data we have!” does not mean that it is the right data, or even good data.↩︎

For instance, can we use a model that predicts whether a borrower will default on a mortgage or not to also predict whether a borrower will default on a car loan or not? The problem is compounded by the fact that there might be some link between mortgage defaults and car loan defaults, but the original model does not necessarily takes this into account.↩︎

The package is called

`rpart`

, the function… also`rpart()`

.↩︎There are other types, such as semi-supervised or reinforcement learning, but these are topics for future modules.↩︎

The response variable \(\mathbf{Y}\) that was segregated away from \(\mathbf{X}\) in the supervised learning case could now be one of the variables in \(\mathbf{X}\).↩︎

Why?↩︎

In particular, if \(\widehat{Y}=f(\vec{X})\), then \(\widehat{Y}\approx Y=f(\vec{X})+\varepsilon\)).↩︎

The proportion must be large enough to bring the variance down.↩︎

In this context, “parametric” means that assumptions are made about the form of the regression function \(f\); “non-parametric” means that no such assumptions are made.↩︎

We will revisit this concept at a later stage.↩︎

In reality, machine learning is simply applied optimization; the proof of this important result is outside the scope of this document (but see [220], [239] for details).↩︎

Failure to do so means that the model can at best be used to describe the training dataset (which might still be a valuable contribution).↩︎

Although it would be surprising if the performance on the test data performance is any good if the performance on the training data is middling. We shall see at a later stage that the training/testing paradigm can also help with problems related to overfitting.↩︎

New test observations may end up assuming the same values as some of the training observations, but that is an accident of sampling or due to the reality of the scenario under consideration.↩︎

Note that \(\mathbf{X}^{\!\top}\mathbf{X}\) is a \(p\times p\) matrix, which makes the inversion relatively easy to compute even when \(n\) is large.↩︎

That is, when impose structure on the learners.↩︎

If \(Y\) represents the total monetary value in a piggy bank, \(X_1\) the number of coins, and \(X_2\) the number of pennies, what is likely to be the sign of \(\beta_2\) in the model \(Y=\beta_0+\beta_1X_1+\beta_2X_2+\varepsilon\)? Are \(X_1\) and \(X_2\) correlated? What would the interpretation look like, in this case?↩︎

Compare with the Bayesian notion of a

**credible interval**(see Module 18).↩︎These are distributions have probability density functions that satisfy \[f(\mathbf{x}\mid \vec{\theta})=h(\mathbf{x})g(\vec{\theta})\exp(\vec{\phi}(\vec{\theta})\cdot \vec{T}(\mathbf{x})).\] This includes the normal, binomial, Poisson, Gamma distributions, etc. These are all distributions with

**conjugate priors**(see Module 18).↩︎For

**orthonormal covariates**\(\mathbf{X}^{\!\top}\mathbf{X}=I_p\), we have, in fact: \[\widehat{\beta}_{\textrm{RR},j}=\frac{\widehat{\beta}_{\textrm{OLS},j}}{1+N\lambda}.\]↩︎For orthonormal covariates, we have \[\widehat{\beta}_{\textrm{BS},j}=\begin{cases} 0 & \text{if $|\widehat{\beta}_{\textrm{LS,j}}|<\sqrt{N\lambda}$} \\ \widehat{\beta}_{\textrm{LS,j}} & \text{if $|\widehat{\beta}_{\textrm{LS,j}}|\geq\sqrt{N\lambda}$}\end{cases}\]↩︎

For orthonormal covariates, we have \[\widehat{\beta}_{\textrm{L},j}=\widehat{\beta}_{\textrm{OLS},j}\cdot \max \left(0,1-\frac{N\lambda}{|\widehat{\beta}_{\textrm{OLS},j}|}\right).\]↩︎

Some methods make direct adjustments to the training error rate in order to estimate the test error (e.g., Mallow’s \(C_p\) statistic, \(R^2_a\), AIC, BIC, etc.)↩︎

It can also provide a basis for model selection.↩︎

For a regression model, there are many options but we typically use \[E_k=\sum_{i\in \mathcal{C}_k} \frac{(y_i-\widehat{y}_i)^2}{n_k}.\]↩︎

The estimate is usually biased, anyway.↩︎

Note that the estimates for \(\beta_0\), \(\beta_1\), and \(\text{MSE}_\text{Te}\) are likely to be correlated from one fold to the next, respectively, since the respective training sets share a fair number of observations.↩︎

Note that this is not as straightforward as one might think, so caution is advised.↩︎

For instance, sampling with replacement at the observation level would not preserve the covariance structure of time series data.↩︎

In practice, linear models have distinct advantages over more sophisticated models, mainly in the areas of superior interpretability and (frequently) appropriate predictive performances (especially for linearly separable data). These “old faithful” models will still be there if fancy deep learning model fails analysts in the future.↩︎

We cannot use \(\text{SSRes}\) or \(R^2\) as metrics in this last step, as we would always select \(\mathcal{M}_p\) since \(\text{SSRes}\) decreases monotonically with \(k\) and \(R^2\) increases monotonically with \(k\). Low \(\text{SSRes}\)/high \(R^2\) are associated with a low training error, whereas the other metrics attempt to say something about the test error, which is what we are after: after all, a model is good if it makes good predictions!↩︎

We are assuming that all models are OLS models, but subset selection algorithms can be used for other families of supervised learning methods; all that is required are appropriate training error estimates for step 2b and test error estimates for step 3.↩︎

“When presented with competing hypotheses about the same prediction, one should select the solution with the fewest assumptions.”↩︎

Thus a 95% \(\text{C.I.}\) can be built just as with polynomial and other regressions.↩︎

Use

`colnames()`

or`str()`

to list all the variables.↩︎If \(h\) represents a straight line, say, the penalty term would be zero.↩︎

The code that would allow for a different random sample every time the code is run has been commented out in the following code box.↩︎

The curse of dimensionality is also in play when \(p\) becomes too large.↩︎

The notion of proximity depends on the distance metric in use; the Euclidean case is the most common, but it does not have to be that one.↩︎

Their number is a measure of a model’s

**complexity**.↩︎Remember however that we have not evaluated the performance of the models on a testing set \(\text{Te}\); we have only described some of its behaviour on the training set \(\text{Tr}\).↩︎

There is another way in which OLS could fail, but it has nothing to do with the OLS assumptions

*per se*. When the set of qualitative responses contains more than \(2\) level (such as \(\mathcal{C}=\{\text{low},\text{medium},\text{high}\}\), for instance), the response is usually encoded using numerals to facilitate the implementation of the analysis: \[Y=\begin{cases}0 & \text{if low} \\ 1 & \text{if medium} \\ 2 & \text{if high} \end{cases}\] This encoding suggests an**ordering**and a**scale**between the levels (for instance, the difference between “high” and “medium” is equal to the difference between “medium” and “low”, and half again as large as the difference between “high” and “low”). OLS is not appropriate in this context.↩︎The probit transformation uses \(g_P(y^*)=\Phi(y^*)\), where \(\Phi\) is the cumulative distribution function of \(\mathcal{N}(0,1)\).↩︎

Any other predictor distribution could be used if it is more appropriate for \(\text{Tr}\), and we could assume that the standard deviations or the means (or both) are identical across classes.↩︎

\(p\) parameters for each \(\hat{\boldsymbol{\mu}}_k\) and \(1+2+\cdots+p\) parameters for \(\hat{\boldsymbol{\Sigma}}\).↩︎

\(p\) parameters for each \(\hat{\boldsymbol{\mu}}_k\) and \(1+2+\cdots+p\) parameters for each \(\hat{\boldsymbol{\Sigma}}_k\).↩︎

Recall that all supervised learning tasks are optimization problems.↩︎

Multiple splitting criteria are used in practice, such as insisting that all final nodes contain 10 or fewer observations, etc.↩︎

This is similar to the bias-variance trade-off or the regularization framework: a good tree balances considerations of fit and complexity.↩︎

The tree is not unique, obviously, but any other tree with separators parallel to the axes will only be marginally better, at best.↩︎

To be sure, we could create an intricate decision tree with \(>2^2=4\) separating lines, but that is undesirable for a well-fitted tree.↩︎

Perfect separation would lead to overfitting.↩︎

\(F(\mathbf{x})\) for points on \(H_{\boldsymbol{\beta},\beta_0}\).↩︎

Technically speaking we do not need to invoke the representer theorem in the linear separable case. At any rate, the result is out-of-scope for this document.↩︎

By analogy with positive definite square matrices, this means that \(\sum_{i,j=1}^Nc_ic_jK(\mathbf{x}_i,\mathbf{x}_j)\geq 0\) for \(\mathbf{x}_i\in \mathbb{R}^p\), \(c_j\in \mathbb{N}\).↩︎

This might seem to go against reduction strategies used to counter the curse of dimensionality; the added dimensions are needed to “unfurl” the data, so to speak.↩︎

In Section 13.5, we argue that it is usually preferable to train a variety of models, rather than just the one.↩︎

The actual values of \(T(\mathbf{x};\boldsymbol{\alpha})\) have no intrinsic meaning, other than their relative ordering.↩︎

In essence, a neural network is a

**function**.↩︎Although grayscale images have only a single colour channel and could thus be stored in 2D tensors, by convention image tensors are always 3D, with a one-dimensional colour channel for grayscale images.↩︎

The gradient is the derivative of a tensor operation; it generalizes the notion of the derivative to functions of multidimensional inputs.↩︎

A beautiful (created by A. Radford) compares the performance of different optimization algorithms and shows that the methods usually take different paths to reach the minimum.↩︎

Nobody disputes the validity of Bayes’ Theorem, and it has proven to be a useful component in various models and algorithms, such as email spam filters, and the following example, but the

**use**of Bayesian statistics is controversial in many quarters.↩︎In other problems, the predictors could be continuous rather than discrete, in which case we would use continuous distributions instead; even in discrete case, the multinomial assumption might not be appropriate.↩︎

Low variance methods, in comparison, are those for which the results, structure, predictions, etc. remain roughly similar when using different training sets, such as OLS when \(N/p\gg 1\), and are less likely to benefit from the use of ensemble learning.↩︎

The AdaBoost code on the Two-Moons dataset was lifted from an online source whose location cannot be found at the moment.↩︎

Formally, a

**kernel**is a symmetric (semi-)positive definite operator \(K:\mathbb{R}^p\times \mathbb{R}^p\to \mathbb{R}_0^+\). By analogy with positive definite square matrices, this means that \(\sum_{i,j=1}^Nc_ic_jK(\mathbf{x}_i,\mathbf{x}_j)\geq 0\) for all \(\mathbf{x}_i\in \mathbb{R}^p\) and all \(c_j\in \mathbb{R}_+\), and \(K(\mathbf{x},\mathbf{w})=K(\mathbf{w},\mathbf{x})\) for all \(\mathbf{x},\mathbf{w}\in \mathbb{R}^p\).↩︎Think free-ranging robots, roughly speaking.↩︎

We agree that this might be a straw-man definition of “post-modernist subjectivity”, but perhaps it is not that much of one, in the end; all things being equal, we lean more toward the objective side of things, in nature and in data analysis.↩︎

Computing the number of such partitions in general cannot be done by elementary means, but it is to show that the number is bounded above by \(n^k\).↩︎

Unfortunately, the clustering results depend very strongly on the initial randomization – a “poor” selection can yield arbitrarily “bad” (sub-optimal) results; \(k-\)means\(++\) selects the initial centroids so as to maximize the chance that they will be well-spread in the dataset (which also speeds up the run-time).↩︎

The results might look good on a 2-dimensional representation of the data, but how do we know it could not look better?↩︎

With \(n\) observations, there are \(1+\cdots+(n-1)=\frac{(n-1)n}{2}\) such pairs.↩︎

Note that each object has multiple dimensions, or attributes available for comparison.↩︎

While we cannot forget that they are not actual apples, we will assume that this is understood and simply refer to the objects as fruit, or apples.↩︎

An important consideration, from a general data science perspective, is whether the signature vector provides a

**sufficient description**of the associated object or whether it is too crude to be of use. This is usually difficult to ascertain prior to obtaining analysis results, and comparing them to the “reality” of the underling system (see Modules 6 and 7 for details).↩︎Keep in mind that different similarity measures may yield various results, in some cases showing the two apples to be similar, in others to be dissimilar.↩︎

While the moniker “distance” harkens back to the notion of Eulidean (physical) distance between points in space, it is important to remember that the measurements refer to the distance between the associated signature vectors, which do not necessarily correspond to their respective physical locations.↩︎

Or in the case of soft clustering, assign each instance a “probability” of belonging to each cluster.↩︎

The similarity matrix is typically required at both stages.↩︎

The specifics of that function are not germane to the current discussion and so are omitted.↩︎

“Clustering validation” suggests that there is an ideal clustering result against which to compare the various algorithmic outcomes, and all that is needed is for analysts to determine how much the outcomes depart from the ideal result. “Cluster quality” is a better way to refer to the process.↩︎

Given that all of them are supposedly provide context-free assessments of clustering quality, that is problematic (although emblematic of unsupervised endeavours).↩︎

The formula for \(\text{RI}(\mathcal{A},\mathcal{B})\) reminds one of the definition of accuracy, a performance evaluation measure for (binary) classifiers.↩︎

In a nutshell, the expected value of \(\text{RI}(\mathcal{A},\mathcal{B})\) for independent, random clusterings \(\mathcal{A}\) and \(\mathcal{B}\) is not 0 [274].↩︎

Which it is emphatically not, it bears repeating.↩︎

in the 4-cluster case, half a cluster seems to have been mis-assigned, for instance.↩︎

These concepts are covered in just enough depth to provide an intuition about the algorithm.↩︎

This cannot be the entire story, however, as we can minimize the total weight of broken edges by simply … not cutting any edges. Indeed, there are other approaches:

**Normalized Cut**(actually used in practice), Ratio Cut, Min-Max Cut, etc.↩︎The spectral MinCut solution is not guaranteed to be the true MinCut solution, but it usually is close enough to be an acceptable approximation.↩︎

For more information about this abstraction, which actually relates a variant of Kernel PCA to spectral clustering, consult [279].↩︎

Since the product of symmetric matrices is not necessarily symmetric.↩︎

This is not the same as the minimum cut which represents the cut that minimizes the number of edges separating two vertices, but instead represents the minimum ratio of edges across the cut divided by the number of vertices in the smaller half of the partition.↩︎

DBSCAN can also fit within that framework, by picking a similarity method based on the radius that allows the graph separate into different components. Then the multiplicity of \(\lambda_0=0\) in the Laplacian gives the number of graph components, and these can be further clustered, as above.↩︎

We borrow extensively from Deng and Han’s

*Probabilistic Models for Clustering*chapter in [4].↩︎This notation can be generalized to

**fuzzy clusters**: the cluster signature of \(\mathbf{x}_j\) is \[\mathbf{z}_j\in [0,1]^k,\quad \|\mathbf{z}_j\|_2=1;\] if \(\mathbf{z}_j=(0,0,\tfrac{1}{\sqrt{2}},\tfrac{1}{\sqrt{2}},0),\) say, then we would interpret \(\mathbf{x}_j\) as belonging equally to clusters \(C_3\) and \(C_4\) or as having probability \(1/2\) of belonging to either \(C_3\) or \(C_4\).↩︎The

`mclust`

vignette contins more information.↩︎As candidate exemplars are themselves observations, we can also compute

**self-responsibility**: \(r(k,k) \leftarrow s(k,k)-\max_{k\neq k'} \{s(k,k')\}.\)↩︎The centroid of the \(\ell\)th cluster is the weighted average of ALL observations by the degree to which they belong to cluster \(\ell\).↩︎

One major challenge with hypergraph partitioning is that a hyperedge can be “broken” by a partitioning in many different ways, not all of which are qualitatively equivalent. Most hypergraph partitioning algorithms use a constant penalty for breaking a hyperedge.↩︎

The distribution of the membership of different instances to the meta-partitions can be used to determine its meta-cluster membership, or soft assignment probability.↩︎

This simple assumption is rather old-fashioned and would be disputed by many in the age of hockey analytics, but let it stand for now.↩︎

Unfortunately for this lifelong Sens fan, it most definitely would…↩︎

This section also serves as an introduction to Text Analysis and Text Mining.↩︎

An entire field of statistical endeavour –

**statistical survey sampling**– has been developed to quantify the extent to which the sample is representative of the population, see Survey Sampling Methods.↩︎For instance, if we are interested in predicting the number of passengers per flight leaving YOW (Macdonald-Cartier International Airport) and the total population of passengers is sampled, then the sampled number of passengers per flight is necessarily below the actual number of passengers per flight. Estimation methods exist to overcome these issues.↩︎

The situation may not be as stark if the observations are not i.i.d., but the principle remains the same – in high-dimensional spaces, it is harder for observations to be near one another than it is so in low-dimensional spaces.↩︎

Although there are scenarios where it could be those “small” axes that are more interesting – such as is the case with the “pancake stack” problem.↩︎

If some of the eigenvalues are 0, \(r<p\), and

*vice-versa*, implying that the data was embedded in a \(r-\)dimensional manifold to begin with.↩︎Which we assume encompasses all of this work’s readership…↩︎

This

**error reconstruction**approach to PCA yields the same results as the**covariance**approach of the previous section [2].↩︎These kernels also appear in

**support vector machines**(see Section 13.4.2).↩︎Excluding \(\mathbf{x}_i\) itself.↩︎

As with LLE, the edges of \(\mathcal{G}\) can be obtained by finding the \(k\) nearest neighbours of each node, or by selecting all points within some fixed radius \(\varepsilon\).↩︎

The first component in the similarity metric measures how likely it is that \(\mathbf{x}_i\) would choose \(\mathbf{x}_j\) as its neighbour if neighbours were sampled from a Gaussian centered at \(\mathbf{x}_i\), for all \(i,j\).↩︎

This usually requires there to be a value to predict, against which the features can be evaluated for relevance; we will discuss this further in Regression and Value Estimation and Spotlight on Classification.↩︎

Either a threshold on the ranking or on the ranking metric value itself.↩︎

This can be quite difficult to determine.↩︎

As filtering is a

**pre-processing step**, proper analysis would also require building a model using this subset of features.↩︎For instance, for a \(p-\)distance \(\delta\), set \[H^{\delta}(x_{i,j})=\arg\min_{\pi_j(\mathbf{z})}=\left\{\delta(\mathbf{x}_i,\mathbf{z})\mid \text{class}(\mathbf{x}_i)=\text{class}(\mathbf{z})\right\}\] and \[M^{\delta}(x_{i,j})=\arg\min_{\pi_j(\mathbf{z})}=\left\{\delta(\mathbf{x}_i,\mathbf{z})\mid \text{class}(\mathbf{x}_i)\neq \text{class}(\mathbf{z})\right\}.\]↩︎

Matrix factorization techniques have applications to other data analytic tasks; notably, they can be used to impute missing values and to build recommender systems.↩︎

Each

**singular value**is the principal square root of the corresponding eigenvalue of the covariance matrix \(\mathbf{X}^{\!\top}\mathbf{X}\) (see Section 15.2.3).↩︎Sparse vectors whose entries are 0 or 1, based on the identity of the words and POS tags under consideration.↩︎

“Ye shall know a word by the company it keeps”, as the

**distributional semantics**saying goes. The term “kumipwam” is not found in any English dictionary, but its probable meaning as “a small beach/sand lizard” could be inferred from its presence in sentences such as “Elowyn saw a tiny scaly kumipwam digging a hole on the beach”. It is easy to come up with examples where the context is ambiguous, but on the whole the contextual approach has proven itself to be mostly reliable.↩︎The problem of selecting \(M\) is tackled as it is in PCA regression.↩︎

Or that their interactions are negligible.↩︎

We have encountered some of these concepts in Section 14.4.2.↩︎

One can think of this as the “reach” of each point.↩︎

For image processing, this kernel is often used with \(\alpha=c=1\).↩︎

It is often used in high-dimensional applications such as text mining.↩︎

In certain formulations, the entries of the adjacency matrix \(A\) are instead defined to take on the value 1 or 0, depending as to whether the similiarity between the corresponding observations is greater than (or smaller than) some pre-determined threshold \(\tau\).

↩︎Remember, the eigenvectors act as functions in this viewpoint. For a given eigenvector \(\lambda_j\), the contour value at each point \(\mathbf{x}_i\) is the value of the associated eigenvector \(\xi_j\) in the \(i^{\text{th}}\) position, namely \(\xi_{j,i}\). For any point \(\mathbf{x}\) not in the dataset, the contour value is given by averaging the \(\xi_{j,k}\) of the observations \(\mathbf{x}_k\) near \(\mathbf{x}\), inversely weighted by the distances \(\|\mathbf{x}_k-\mathbf{x}\|\).↩︎

As a reminder, the eigenvalues themselves are ordered in increasing sequence: for the current example, \[\begin{aligned} \lambda_{1}=0 \leq \lambda_{2}&=1.30 \times 10^{-2}\leq \lambda_{3}= 3.94\times 10^{-2} \leq \cdots\lambda_{20}=2.95\leq\cdots\end{aligned}\]↩︎

In the remainder of this section, the subscript is dropped. Note that \(q\) is assumed, not found by the process.↩︎

Careful: the correct

`Python`

package to install is`umap-learn`

, not`umap`

.↩︎Outlying observations may be anomalous along any of the individual variables, or in combinations of variables.↩︎

Which, by the way, should always be seen as a welcomed development.↩︎

Note that

**normality**of the underlying data is an assumption for most tests; how robust these tests are against departures from this assumption depends on the situation.↩︎Before carrying out seasonal adjustment, it is important to identify and pre-adjust for structural breaks (using the Chow test, for instance), as their presence can give rise to severe distortions in the estimation of the Trend and Seasonal effects. Seasonal breaks occur when the usual seasonal activity level of a particular time reporting unit changes in subsequent years. Trend breaks occurs when the trend in a data series is lowered or raised for a prolonged period, either temporarily or permanently. Sources of these breaks may come from changes in government policies, strike actions, exceptional events, inclement weather, etc.↩︎

X12 is implemented in SAS and R, among other platforms.↩︎

The simplest way to determine whether to use multiplicative or additive decomposition is by graphing the time series. If the size of the seasonal variation increases/decreases over time, multiplicative decomposition should be used. On the other hand, if the seasonal variation seems to be constant over time, additive model should be used. A pseudo-additive model should be used when the data exhibits the characteristics of the multiplicative series, but parameter values are close to zero.↩︎

Nevertheless, the analyst for whom the full picture is important might want to further evaluate the algorithm with the help of the

**Matthews Correlation Coefficient**[320] or the**specificity**\(s=\frac{\text{TN}}{\text{FP}+\text{TN}}\).↩︎This is not the case for the parameters in general clustering algorithms: if the elements of \(D\) are \(n-\)dimensional, the only restriction is that \(m\geq n+1\) (larger values of \(m\) allow for better noise identification).↩︎

While we are on the topic, regression on categorical variables is called

**multinomial logistic regression**.↩︎What does this assume, if anything at all, about the features’ independence.↩︎

Strictly speaking, the AVF score would be minimized when each of the observation’s features’ levels occur zero time in the dataset, but then … the observation would not

*actually*be in the dataset. ↩︎The available methods are all methods that we have not discussed:

`HDoutliers()`

from the package`HDoutliers`

,`FastPCS()`

from the package`FastPCS`

,`mvBACON()`

from`robustX`

,`adjOutlyingness()`

and`covMcd()`

from`robustbase`

, and`DectectDeviatingCells()`

from`cellWise`

.↩︎It is

**EXTREMELY IMPORTANT**that these flaws not simply be swept under the carpet; they need to be addressed, and the analysis outcomes that result must be presented or reported on with an appropriate*caveat*.↩︎The

`R`

equivalent is`rvest`

; we will not describe how to use it, but you are**strongly encouraged**to read up on this versatile tool and to use it in the Exercises.↩︎Wikipedia is a commonly-used source of data on various topics (in a first pass, at the very least), but it should probably not be your ONLY source of information.↩︎

Such as would be used in mathematical reasoning.↩︎

Modern Bayesian statistics is still based on formulating probability distributions to express uncertainty about unknown quantities. These can be underlying parameters of a system (induction) or future observations (prediction).

**Bayesian statistics**is a system for describing epistemiological uncertainty using the mathematical language of probability;**Bayesian inference**is the process of fitting a probability model to a set of data and summarizing the result with a probability distribution on the parameters of the model and on unobserved quantities (such as predictions).↩︎The integral of these priors over the positive quadrant is infinite.↩︎

We use a different seed, so the charts are slightly different, but the main ideas hold.↩︎

Would we expect there to be more bills in circulation, given these observations, in the brittle case or the simple case?↩︎

We use a different seed, so the charts are slightly different, but the main ideas hold.↩︎

We will work with the logarithms of all quantities, so that the likelihood is a sum and not a product as would usually be the case.↩︎

The algorithm may be used to sample from any integrable function.↩︎

In the worst case scenario, \(M\) would have to be smaller than the total amount of wealth available to humanity throughout history, although in practice \(M\) should be substantially smaller. Obviously, a different argument will need to be made in the case \(M=\infty\).↩︎