## 11.3 Association Rules Mining

Correlation isn’t causation. But it’s a big hint. [E. Tufte]

### 11.3.1 Overview

**Association rules discovery** is a type of unsupervised learning that finds **connections** among the attributes and levels (and combinations thereof) of a dataset’s observations. For instance, we might analyze a (hypothetical) dataset on the physical activities and purchasing habits of North Americans and discover that

runners who are also triathletes (the

**premise**) tend to drive Subarus, drink microbrews, and use smart phones (the**conclusion**), orindividuals who have purchased home gym equipment are unlikely to be using it 1 year later, say.

But the presence of a **correlation** between the premise and the conclusion does not necessarily imply the existence of a **causal relationship** between them. It is rather difficult to “demonstrate” causation *via* data analysis; in practice, decision-makers pragmatically (and often erroneously) focus on the second half of Tufte’s rejoinder, which basically asserts that “there’s no smoke without fire.”

Case in point, while being a triathlete does not cause one to drive a Subaru, Subaru Canada thinks that the connection is strong enough to offer to reimburse the registration fee at an IRONMAN 70.3 competition (since at least 2018)! [198]

#### Market Basket Analysis

Association rules discovery is also known as **market basket analysis** after its original application, in which supermarkets record the contents of shopping carts (the **baskets**) at check-outs to determine which items are frequently purchased together.

For instance, while bread and milk might often be purchased together, that is unlikely to be of interest to supermarkets given the frequency of market baskets containing milk **or** bread (in the mathematical sense of “or”).

Knowing that a customer has purchased bread does provide some information regarding whether they also purchased milk, but the individual probability that each item is found, separately, in the basket is so high to begin with that this insight is unlikely to be useful.

If 70% of baskets contain milk and 90% contain bread, say, we would expect **at least** \[90\%\times 70\%=63\%\] of all baskets to contain milk **and** bread, should the presence of one in the basket be **totally independent** of the presence of the other.

If we then observe that 72% of baskets contain both items (a 1.15-fold increase on the expected proportion, assuming there is no link), we would conclude that there was at best a **weak correlation** between the purchase of milk and the purchase of bread.

Sausages and hot dog buns, on the other hand, which we might suspect are not purchased as frequently as milk and bread, might still be purchased as a pair more often than one would expect given the frequency of baskets containing sausages **or** buns.

If 10% of baskets contain sausages, and 5% contain buns, say, we would expect that \[10\% \times 5\% = 0.5\%\] of all baskets would contain sausages **and** buns, should the presence of one in the basket be **totally independent** of the presence of the other.

If we then observe that 4% of baskets contain both items (an 8-fold increase on the expected proportion, assuming there is no link), we would obviously conclude that there is a **strong correlation** between the purchase of sausages and the purchase of hot dog buns.

It is not too difficult to see how this information could potentially be used to help supermarkets turn a profit: announcing or advertising a sale on sausages while **simultaneously** (and quietly) raising the price of buns could have the effect of bringing in a higher number of customers into the store, increasing the sale volume for both items while keeping the combined price of the two items constant.^{169}

A (possibly) apocryphal story shows the limitations of association rules: a supermarket found an association rule linking the purchase of beer and diapers and consequently moved its beer display closer to its diapers display, having confused correlation and causation.

Purchasing diapers does not cause one to purchase beer (or *vice-versa*); it could simply be that parents of newborns have little time to visit public houses and bars, and whatever drinking they do will be done at home. Who knows? Whatever the case, rumour has it that the experiment was neither popular nor successful.

#### Applications

Typical uses include:

finding

**related concepts**in text documents – looking for pairs (triplets, etc) of words that represent a joint concept: {San Jose, Sharks}, {Michelle, Obama}, etc.;detecting

**plagiarism**– looking for specific sentences that appear in multiple documents, or for documents that share specific sentences;identifying

**biomarkers**– searching for diseases that are frequently associated with a set of biomarkers;making predictions and decisions based on association rules (there are pitfalls here);

altering circumstances or environment to take advantage of these correlations (suspected causal effect);

using connections to modify the likelihood of certain outcomes (see immediately above);

imputing missing data,

text autofill and autocorrect, etc.

Other uses and examples can be found in [132], [199], [200].

#### Causation and Correlation

Association rules can automate **hypothesis discovery**, but one must remain correlation-savvy (which is less prevalent among quantitative specialists than one might hope, in our experience).

If attributes \(A\) and \(B\) are shown to be correlated in a dataset, there are four possibilities:

\(A\) and \(B\) are correlated entirely by chance in this particular dataset;

\(A\) is a relabeling of \(B\) (or

*vice-versa*);\(A\) causes \(B\) (or

*vice-versa*), orsome combination of attributes \(C_1,\ldots,C_n\) (which may not be available in the dataset) cause both \(A\) and \(B\).

Siegel [199] illustrates the confusion that can arise with a number of real-life examples:

Walmart has found that sales of strawberry Pop-Tarts increase about seven-fold in the days preceding the arrival of a hurricane;

Xerox employees engaged in front-line service and sales-based positions who use Chrome and Firefox browsers perform better on employment assessment metrics and tend to stay with the company longer, or

University of Cambridge researchers found that liking “Curly Fries” on Facebook is predictive of high intelligence.

It can be tempting to try to **explain** these results (again, from [199]): perhaps

when faced with a coming disaster, people stock up on comfort or nonperishable foods;

the fact that an employee takes the time to install another browser shows that they are an informed individual and that they care about their productivity, or

an intelligent person liked this Facebook page first, and her friends saw it, and liked it too, and since intelligent people have intelligent friends (?), the likes spread among people who are intelligent.

While these explanations *might* very well be the right ones (although probably not in the last case), there is **nothing in the data** that supports them. Association rules discovery **finds** interesting rules, but it does not explain them. **The point cannot be over-emphasized**: correlation does not imply causation.

Analysts and consultants might not have much control over the matter, but they should do whatever is in their power so that the following headlines do not see the light of day:

“Pop-Tarts” get hurricane victims back on their feet;

Using Chrome of Firefox improves employee performance, or

Eating curly fries makes you more intelligent.

#### Definitions

A rule \(X\to Y\) is a statement of the form “if \(X\) (the **premise**) then \(Y\) (the conclusion)” built from any logical combinations of a dataset attributes.

In practice, a rule **does not need to be true for all observations** in the dataset – there could be instances where the premise is satisfied but the conclusion is not.

In fact, some of the “best” rules are those which are only accurate 10% of the time, as opposed to rules which are only accurate is only 5% of the time, say. As always, **it depends on the context**. To determine a rule’s strength, we compute various rule metrics, such as the:

**support**, which measures the frequency at which a rule occurs in a dataset – low coverage values indicate rules that rarely occur;**confidence**, which measures the reliability of the rule: how often does the conclusion occur in the data given that the premises have occurred – rules with high confidence are “truer”, in some sense;**interest**, which measures the difference between its confidence and the relative frequency of its conclusion – rules with high absolute interest are … more interesting than rules with small absolute interest;**lift**, which measures the increase in the frequency of the conclusion which can be explained by the premises – in a rule with a high lift (\(>1\)), the conclusion occurs more frequently than it would if it was independent of the premises;**conviction**[201],**all-confidence**[202],**leverage**[203],**collective strength**[204], and many others [205], [206].

In a dataset with \(N\) observations, let \(\textrm{Freq}(A)\in \{0,1,\ldots,N\}\) represent the count of the dataset’s observations for which property \(A\) holds. This is all the information that is required to compute a rule’s evaluation metrics: \[\begin{aligned} \textrm{Support}(X\to Y)&=\frac{\textrm{Freq}(X\cap Y)}{N}\in[0,1] \\ \textrm{Confidence}(X\to Y)&=\frac{\textrm{Freq}(X\cap Y)}{\textrm{Freq}(X)}\in[0,1] \\ \textrm{Interest}(X\to Y)&=\textrm{Confidence}(X\to Y) - \frac{\textrm{Freq}(Y)}{N} \in [-1,1] \\ \textrm{Lift}(X\to Y) &=\frac{N^2\cdot \textrm{Support}(X\to Y)}{\textrm{Freq}(X)\cdot \textrm{Freq}(Y)} \in (0,N^2) \\ \textrm{Conviction}(X\to Y)&=\frac{1-\textrm{Freq(Y)}/N}{1-\textrm{Confidence}(X\to Y)}\geq 0\end{aligned}\]

#### British Music Dataset

A simple example will serve to illustrate these concepts. Consider a (hypothetical) music dataset containing data for \(N=15,356\) British music lovers and a **candidate rule** RM:

“If an individual is born before 1976 (\(X\)), then they own a copy of the Beatles’

Sergeant Peppers’ Lonely Hearts Club Band, in some format (\(Y\))”.

Let’s assume further that

\(\textrm{Freq}(X)=3888\) individuals were born before 1976;

\(\textrm{Freq}(Y)=9092\) individuals own a copy of

*Sergeant Peppers’ Lonely Hearts Club Band*, and\(\textrm{Freq}(X\cap Y)=2720\) individuals were born before 1976 and own a copy of

*Sergeant Peppers’ Lonely Hearts Club Band*.

We can easily compute the 5 metrics for RM: \[\begin{aligned} \textrm{Support}(\textrm{RM})&=\frac{2720}{15,536}\approx 18\% \\ \textrm{Confidence}(\textrm{RM})&=\frac{2720}{3888}\approx 70\% \\ \textrm{Interest}(\textrm{RM})&=\frac{2720}{3888}-\frac{9092}{15,356}\approx 0.11 \\
\textrm{Lift}(\textrm{RM}) &=\frac{15,356^2\cdot 0.18}{3888\cdot 9092} \approx 1.2 \\ \textrm{Conviction}(\textrm{RM}) &=\frac{1-9092/15,356}{1-2720/3888} \approx 1.36\end{aligned}\]
These values are easy to interpret: RM occurs in **18%** of the dataset’s instances, and it holds true in **70%** of the instances where the individual was born prior to 1986.

This would seem to make RM a **meaningful rule** about the dataset – being older and owning that song are linked properties. But if being younger and not owning that song are not also linked properties, the statement is actually weaker than it would appear at a first glance.

As it happens, RM’s lift is **1.2**, which can be rewritten as \[1.2\approx \frac{0.70}{0.56},\] i.e. 56% of younger individuals also own the song.

The ownership rates between the two age categories are different, but perhaps not as significantly as one would deduce using the confidence and support alone, which is reflected by the rule’s “low” interest, whose value is **0.11**.

Finally, the rule’s conviction is **1.36**, which means that the rule would be incorrect 36% more often if \(X\) and \(Y\) were completely independent.

All this seems to point to the rule RM being not entirely devoid of meaning, but to what extent, exactly? **This is a difficult question to answer**.^{170}

It is nearly impossible to provide **hard** and **fast** thresholds: it always depends on the context, and on comparing evaluation metric values for a rule with the values obtained for some other of the dataset’s rules. In short, evaluation of a lone rule is **meaningless**.

In general, it is recommended to conduct a **preliminary exploration** of the space of association rules (using domain expertise when appropriate) in order to determine reasonable threshold ranges for the specific situation; candidate rules would then be discarded or retained depending on these metric thresholds.

This requires the ability to “easily” generate potentially meaningful candidate rules.

### 11.3.2 Generating Rules

Given association rules, it is straightforward to evaluate them using various metrics, as discussed in the previous section.

The real challenge of association rules discovery lies in **generating** a set of candidate rules which are likely to be retained, without wasting time generating rules which are likely to be discarded.

An **itemset** (or instance set) for a dataset is a list of attributes and values. A set of **rules** can be created from the itemset by adding “IF … THEN” blocks to the instances.

As an example, from the instance set

\[\{ \textrm{membership} = \textrm{True}, \textrm{age} = \textrm{Youth}, \textrm{purchasing} = \textrm{Typical} \},\]

we can create the 7 following \(3-\)item rules:

IF \((\textrm{membership} = \textrm{True}\) AND \(\textrm{age} = \textrm{Youth}\)) THEN \(\textrm{purchasing} = \textrm{Typical}\);

IF \((\textrm{age} = \textrm{Youth}\) AND \(\textrm{purchasing} = \textrm{Typical}\)) THEN \(\textrm{membership} = \textrm{True}\);

IF \((\textrm{purchasing} = \textrm{Typical}\) AND \(\textrm{membership} = \textrm{True})\) THEN \(\textrm{age} = \textrm{Youth}\);

IF \(\textrm{membership} = \textrm{True}\) THEN (\(\textrm{age} = \textrm{Youth}\) AND \(\textrm{purchasing} = \textrm{Typical}\));

IF \(\textrm{age} = \textrm{Youth}\) THEN \((\textrm{purchasing} = \textrm{Typical}\) AND \(\textrm{membership} = \textrm{True})\);

IF \(\textrm{purchasing} = \textrm{Typical}\) THEN \((\textrm{membership} = \textrm{True})\) AND \(\textrm{age} = \textrm{Youth})\);

IF \(\varnothing\) THEN (\(\textrm{membership} = \textrm{True}\) AND \(\textrm{age} = \textrm{Youth}\) AND \(\textrm{purchasing} = \textrm{Typical}\));

the 6 following \(2-\)item rules:

IF \(\textrm{membership} = \textrm{True}\) THEN \(\textrm{purchasing} = \textrm{Typical}\);

IF \(\textrm{age} = \textrm{Youth}\) THEN \(\textrm{membership} = \textrm{True}\);

IF \(\textrm{purchasing} = \textrm{Typical}\) THEN \(\textrm{age} = \textrm{Youth}\);

IF \(\varnothing\) THEN (\(\textrm{age} = \textrm{Youth}\) AND \(\textrm{purchasing} = \textrm{Typical}\));

IF \(\varnothing\) THEN \((\textrm{purchasing} = \textrm{Typical}\) AND \(\textrm{membership} = \textrm{True})\);

IF \(\varnothing\) THEN \((\textrm{membership} = \textrm{True})\) AND \(\textrm{age} = \textrm{Youth})\);

and the 3 following \(1-\)item rules:

IF \(\varnothing\) THEN \(\textrm{age} = \textrm{Youth}\);

IF \(\varnothing\) THEN \(\textrm{purchasing} = \textrm{Typical}\);

IF \(\varnothing\) THEN \(\textrm{membership} = \textrm{True}\).

In practice, we usually only consider rules with the same number of items as there are members in the itemset: in the example above, for instance, the \(2-\)item rules could be interpreted as emerging from the 3separate itemsets

\[\begin{align*}\{\textrm{membership} &= \textrm{True}, \textrm{age} = \textrm{Youth}\} \\ \{\textrm{age} &= \textrm{Youth}, \textrm{purchasing} = \textrm{Typical}\} \\ \{\textrm{purchasing} &= \textrm{Typical}, \textrm{membership} = \textrm{True}\}\end{align*}\]

and the \(1-\)item rules as arising from the 3 separate itemsets

\[\{\textrm{membership} = \textrm{True}\},\{\textrm{age} = \textrm{Youth}\}, \{\textrm{purchasing} = \textrm{Typical}\}.\]

Note that rules of the form \(\varnothing \to X\) (or IF \(\varnothing\) THEN \(X\)) are typically denoted simply by \(X\).

Now, consider an itemset \(\mathcal{C}_n\) with \(n\) members (that is to say, \(n\) attribute/level pairs). In an \(n-\)item rule derived from \(\mathcal{C}\), each of the \(n\) members appears either in the premise or in the conclusion; there are thus \(2^n\) such rules, in principle.

The rule where each member is part of the premise (i.e., the rule without a conclusion) is nonsensical and is not allowed; we can derive exactly \(2^n-1\) \(n-\)item rules from \(\mathcal{C}_n\). Thus, the **number of rules increases exponentially** when the **number of features increases linearly**.

This combinatorial explosion is a problem – it instantly disqualifies the **brute force** approach (simply listing all possible itemsets in the data and generating all rules from those itemsets) for any dataset with a realistic number of attributes.

How can we then generate a small number of **promising** candidate rules, in general?

### 11.3.3 The A Priori Algorithm

The ** a priori** algorithm is an early attempt to overcome that difficulty. Initially, it was developed to work for

**transaction data**(i.e. goods as columns, customer purchases as rows), but every reasonable dataset can be transformed into a transaction dataset using dummy variables.

The algorithm attempts to find **frequent itemsets** from which to build candidate rules, instead of building rules from **all** possible itemsets.

It starts by identifying frequent **individual items** in the database and extends those that are retained into larger and larger **item supersets**, who are themselves retained only if they occur **frequently enough** in the data.

The main idea is that “all non-empty subsets of a frequent itemset must also be frequent” [207], or equivalently, that all supersets of an infrequent itemset must also be infrequent (see Figure 11.4).

In the technical jargon of machine learning, we say that a priori uses a **bottom-up approach** and the **downward closure property of support**.

The memory savings arise from the fact that the algorithm prunes candidates with **infrequent sub-patterns** and removes them from consideration for any future itemset: if a \(1-\)itemset is not considered to be frequent enough, any \(2-\)itemset containing it is also infrequent (see Figure 11.5 for another illustration).

A list of the 4 teams making the playoffs each year is shown on the left (\(N=20\)). Frequent itemsets are generated using the *a priori* algorithms, with a support threshold of 10. We see that there are \(5\) frequent \(1-\)itemsets, top row, in yellow (New York made the playoffs \(6<10\) times – no larger frequent itemset can contain New York). 6 frequent \(2-\)itemsets are found in the subsequent list of ten \(2-\)itemsets, top row, in green (note the absence of New York). Only 2 frequent \(3-\)itemsets are found, top row, in orange. Candidate rules are generated from the shaded itemsets; the rules retained by the thresholds \[\textrm{Support}\geq 0.5,\ \textrm{Confidence}\geq 0.7, \text{ and }\textrm{Lift}>1\ \text{(barely)},\] are shown in the table on the bottom row – the main result is that when Boston made the playoffs, it was not surprising to see Detroit also make the playoffs (the presence or absence of Montreal in a rule is a red herring, as Montreal made the playoffs every year in the data). Are these rules meaningful at all?

Of course, this process requires a support threshold **input**, for which there there is no guaranteed way to pick a “good” value; it has to be set sufficiently high to minimize the number of frequent itemsets that are being considered, but not so high that it removes too many candidates from the **output list**; as ever, optimal threshold values are **dataset-specific**.

The algorithm terminates when no further itemsets extensions are retained, which always occurs given the finite number of levels in categorical datasets.

**Strengths:**easy to implement and to parallelize [208];**Limitations:**slow, requires frequent data set scans, not ideal for finding rules for infrequent and rare itemsets.

More efficient algorithms have since displaced it in practice (although the a priori algorithm retains historical value):

**Max-Miner**tries to identify frequent itemsets without enumerating them – it performs jumps in itemset space instead of using a bottom-up approach;**Eclat**is faster and uses depth-first search, but requires extensive memory storage (a priori and eclat are both implemented in the`R`

package`arules`

[202]).

### 11.3.4 Validation

How **reliable** are association rules? What is the likelihood that they occur entirely **by chance**? How **relevant** are they? Can they be generalised **outside** the dataset, or to **new** data streaming in?

These questions are notoriously difficult to solve for association rules discovery, but **statistically sound association discovery** can help reduce the risk of finding spurious associations to a user-specified significance level [205], [206]. We end this section with a few comments:

Since frequent rules correspond to instances that occur repeatedly in the dataset, algorithms that generate itemsets often try to

**maximize coverage**. When**rare events**are more meaningful (such as detection of a rare disease or a threat), we need algorithms that can generate rare itemsets.**This is not a trivial problem**.Continuous data has to be binned into

**categorical**data to generate rules. As there are many ways to accomplish that task, the same dataset can give rise to completely different rules. This could create some credibility issues with clients and stakeholders.Other popular algorithms include: AIS, SETM, aprioriTid, aprioriHybrid, PCY, Multistage, Multihash, etc.

Additional evaluation metrics can be found in the

`arules`

documentation [202].

### 11.3.5 Case Study: Danish Medical Data

In *temporal disease trajectories condensed from population wide registry data covering 6.2 million patients* [126], A.B. Jensen et al. study diagnoses in the Danish population, with the help of association rules mining and clustering methods.

#### Objectives

Estimating **disease progression** (trajectories) from current patient state is a crucial notion in medical studies. Such trajectories had (at the time of publication) only been analyzed for a small number of diseases, or using large-scale approaches without consideration for time exceeding a few years. Using data from the *Danish National Patient Registry* (an extensive, long-term data collection effort by Denmark), the authors sought connections between different **diagnoses**: how does the presence of a diagnosis at some point in time allow for the prediction of another diagnosis at a later point in time?

#### Methodology

The authors took the following methodological steps:

compute the

**strength of correlation**for pairs of diagnoses over a 5 year interval (on a representative subset of the data);test diagnoses pairs for

**directionality**(one diagnosis repeatedly occurring before the other);determine reasonable

**diagnosis trajectories**(thoroughfares) by combining smaller (but frequent) trajectories with overlapping diagnoses;**validate**the trajectories by comparison with non-Danish data;**cluster**the thoroughfares to identify a small number of**central medical conditions**(key diagnoses) around which disease progression is organized.

#### Data

The Danish National Patient Registry is an electronic health registry containing administrative information and diagnoses, covering the whole population of Denmark, including private and public hospital visits of all types: inpatient (overnight stay), outpatient (no overnight stay) and emergency. The data set covers 15 years, from January ’96 to November ’10 and consists of 68 million records for 6.2 million patients.

#### Challenges and Pitfalls

Access to the

**Patient Registry**is protected and could only be granted after approval by the*Danish Data Registration Agency the National Board of Health*.Gender-specific differences in diagnostic trends are clearly identifiable (pregnancy and testicular cancer do not have much cross-appeal), but many diagnoses were found to be made exclusively (or at least, predominantly) in different sites (inpatient, outpatient, emergency ward), which suggests the importance of stratifying by

**site**as well as by**gender**.In the process of forming small diagnoses chains, it became necessary to compute the correlations using

**large groups**for each pair of diagnoses. For close to 1 million diagnosis pairs, more than 80 million samples would have been required to obtain significant \(p-\)values while compensating for**multiple testing**, which would have translated to a few thousand years’ worth of computer running time. A pre-filtering step was included to avoid this pitfall.^{171}

#### Project Summary and Results

The dataset was reduced to **1,171 significant trajectories**. These thoroughfares were clustered into patterns centred on 5 key diagnoses central to disease progression:

**diabetes**;**chronic obstructive pulmonary disease**(COPD);**cancer**;**arthritis**, and**cerebrovascular disease**.

Early diagnoses for these central factors can help reduce the risk of adverse outcome linked to future diagnoses of other conditions.

Two author quotes illustrate the importance of these results:

“The sooner a health risk pattern is identified, the better we can prevent and treat critical diseases.” [S. Brunak]

“Instead of looking at each disease in isolation, you can talk about a complex system with many different interacting factors. By looking at the order in which different diseases appear, you can start to draw patterns and see complex correlations outlining the direction for each individual person.” [L.J. Jensen]

Among the specific results, the following “surprising” insights were found:

a diagnosis of anemia is typically followed months later by the discovery of colon cancer;

gout was identified as a step on the path toward cardiovascular disease, and

COPD is under-diagnosed and under-treated.

The disease trajectories cluster for COPD, for instance, is shown in Figure 11.6.

### 11.3.6 Toy Example: Titanic Dataset

Compiled by Robert Dawson in 1995, the *Titanic* dataset consists of 4 categorical attributes for each of the 2201 people aboard the Titanic when it sank in 1912 (some issues with the dataset have been documented, but we will ignore them for now):

**class**(1st class, 2nd class, 3rd class, crewmember)**age**(adult, child)**sex**(male, female)**survival**(yes, no)

The natural question of interest for this dataset is:

“How does survival relate to the other attributes?”

This is not, strictly speaking, an unsupervised task (as the interesting rules’ structure is fixed to conclusions of the form \(\textrm{survival} = \textrm{Yes}\) or \(\textrm{survival} = \textrm{No}\)).

For the purpose of this example, we elect not to treat the problem as a **predictive task**, since the situation on the Titanic has little bearing on survival for new data – as such, we use fixed-structure association rules to **describe** and **explore** survival conditions on the *Titanic* (compare with [209]).

We use the `arules`

implementation of the *a priori* algorithm in `R`

to generate and prune candidate rules, eventually leading to **8 rules** (the results are visualized in Figure 11.7). Who survived? Who didn’t?^{172}

We show how to obtain these rules *via* `R`

in Association Rules Mining: Titanic Dataset.

### References

*et al.*, “Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients,”

*Nature Communications*, vol. 5, 2014, doi: 10.1038/ncomms5022.

*Journal of the American Medical Informatics Association*, vol. 5, no. 4, pp. 373–381, Jul. 1998, doi: 10.1136/jamia.1998.0050373.

*Predictive analytics: The power to predict who will click, buy, lie or die*. Predictive Analytics World, 2016.

*IEEE Transactions on Knowledge and Data Engineering*, vol. 15, no. 1, pp. 57–69, 2003, doi: 10.1109/TKDE.2003.1161582.

*Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems*, 1998, pp. 18–24. doi: 10.1145/275487.275490.

*Inf. Syst.*, vol. 29, no. 4, pp. 293–313, Jun. 2004, doi: 10.1016/S0306-4379(03)00072-3.

*CoRR*, vol. abs/0803.0966, 2008.

*Towards Data Science*, Oct. 2020.

*Mining of Massive Datasets*. Cambridge Press, 2014.

*Kaggle.com*, 2016.