7.3 Ethics in the Data Science Context

A lapse in ethics can be a conscious choice… but it can also be negligence. [103]

In most empirical disciplines, ethics are brought up fairly early in the educational process and may end up playing a crucial role in researchers’ activities. At Memorial University of Newfoundland, for instance, “proposals for research in the social sciences, humanities, sciences, and engineering, including some health-related research in these areas,” must receive approval from specific Ethics Review Boards.

This could, among other cases, apply to research and analysis involving [104]:

  • living human subjects;

  • human remains, cadavers, tissues, biological fluids, embryos or foetuses;

  • a living individual in the public arena if they are to be interviewed and/or private papers accessed;

  • secondary use of data – health records, employee records, student records, computer listings, banked tissue – if any form of identifier is involved and/or if private information pertaining to individuals is involved, and

  • quality assurance studies and program evaluations which address a research question.

In our experience, data scientists and data analysts who come to the field by way of mathematics, statistics, computer science, economics, or engineering, however, are not as likely to have encountered ethical research boards or to have had formal ethics training.122 Furthermore, discussions on ethical matters are often tabled, perhaps understandably, in favour of pressing technical or administrative considerations (such as algorithm selection, data cleaning strategies, contractual issues, etc.) when faced with hard deadlines.

The problem, of course, is that the current deadline is eventually replaced by another deadline, and then by a new deadline, with the end result being that the conversation may never take place. It is to address this all-too-common scenario that we take the time to discuss ethics in the data science context; more information is available in [105], [106].

7.3.1 The Need for Ethics

When large-scale data collection first became possible, there was to some extent a ‘Wild West’ mentality to data collection and use. To borrow from the old English law principle, whatever was not prohibited (from a technological perspective) was allowed.

Now, however, professional codes of conduct are being devised for data scientists [107][109], outlining responsible ways to practice data science – ways that are legitimate rather than fraudulent, and ethical rather than unethical.123 Although this shifts some added responsibility onto data scientists, it also provides them with protection from clients or employers who would hire them to carry out data science in questionable ways – they can refuse on the grounds that it is against their professional code of conduct.

7.3.2 What Is/Are Ethics?

Broadly speaking, ethics refers to the study and definition of right and wrong conduct. Ethics may consider what is right or wrong when it comes to actions in general, or consider how broad ethical principles are appropriately applied in more specific circumstances.

And, as noted by R.W. Paul and L. Elder, ethics is not (necessarily) the same as social convention, religious beliefs, or laws [113]; that distinction is not always fully understood. The following influential ethical theories are often used to frame the debate around ethical issues in the data science context:

  • Golden rule: do unto others as you would have them do unto you;

  • Consequentialism: the end justifies the means;

  • Utilitarianism: act in order to maximize positive effect;

  • Moral Rights: act to maintain and protect the fundamental rights and privileges of the people affected by actions;

  • Justice: distribute benefits and harm among stakeholders in a fair, equitable, or impartial way.

In general, it is important to remember that our planet’s inhabitants subscribe to a wide variety of ethical codes, including:

Confucianism, Taoism, Buddhism, Shinto, Ubuntu, Te Ara Tika (Maori), First Nations Principles of OCAP, various aspects of Islamic ethics, etc.

It is not too difficult to imagine contexts in which any of these (or other ethical codes, or combinations thereof) would be better-suited to the task at hand – the challenge is to remember to inquire and to heed the answers.

7.3.3 Ethics and Data Science

How might these ethical theories apply to data analysis? The (former) University of Virginia’s Centre for Big Data Ethics, Law and Policy suggested some specific examples of data science ethics questions [114]:

  • who, if anyone, owns data?

  • are there limits to how data can be used?

  • are there value-biases built into certain analytics?

  • are there categories that should never be used in analyzing personal data?

  • should data be publicly available to all researchers?

The answers may depend on a number of factors, not the least of which is the matter of who is actually providing them to you. To give you an idea of some of the complexities, let us consider as an example the first of those questions: who, if anyone, owns data?

In some sense, the data analysts who transform the data’s potential into usable insights are only one of the links in the entire chain. Processing and analyzing the data would be impossible without raw data on which to work, so the data collectors also have a strong ownership claim to the data.

But collecting the data can be a costly endeavour, and it is easy to imagine how the sponsors or employers (who made the process economically viable in the first place) might feel that the data and its insights are rightfully theirs to dispose of as they wish.

In some instances, the law may chime in as well. Indeed, one can easily list other players, but let it suffice to say that this simple question turns out to be far from easily answered, and may even change from case to case. Incidentally, this also highlights a hidden truth regarding the data analysis process: there is more to data analysis than just data analysis.

A similar challenge arises in regards to open data, where the “pro” and “anti” factions both have strong arguments (see [115][117], and [118] for a science-fictional treatment of the transparency-vs.-secrecy/security debate).

The answers to the above ethical questions aside, a general principle of data analysis is to eschew the anecdotal in favour of the general – from a purely analytical perspective, too narrow a focus on specific observations can end up obscuring the full picture (a vivid illustration can be found in [119]).

But data points are not solely marks on paper or electro-magnetic bytes on the cloud. Decisions made on the basis of data science (in all manners of contexts, from security, to financial and marketing context, as well as policy) may affect living beings in negative ways. And it can not be ignored that outlying/marginal individuals and minority groups often suffer disproportionately at the hands of so-called evidence-based decisions [120][122].

7.3.4 Guiding Principles

Under the assumption that one is convinced of the importance of proceeding ethically, it could prove helpful to have a set of guiding principles to aid in these efforts.

In his seminal science fiction series about positronic robots, Isaac Asimov introduced the now-famous Laws of Robotics, which he believed would have to be built-in so that robots (and by extension, any tool used by human beings) could overcome humanity’s Frankenstein’s complex (the fear of mechanical beings) and help rather than hinder human social, scientific, cultural, and economic activities [123]:

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.

2. A robot must obey the orders given to it by human beings, except where such orders would conflict with the 1st Law.

3. A robot must protect its own existence as long as such protection does not conflict with the 1st and 2nd Law.

Had they been uniformly well-implemented and respected, the potential for story-telling would have been somewhat reduced; thankfully, Asimov found entertaining ways to break the Laws (and to resolve the resulting conflicts) which made the stories both enjoyable and insightful.

Interestingly enough, he realized over time that a Zeroth Law had to supersede the First in order for the increasingly complex and intelligent robots to succeed in their goals. Later on, other thinkers contributed a few others, filling in some of the holes.

Asimov’s (expanded) Laws of Robotics:

00. A robot may not harm sentience or, through inaction, allow sentience to come to harm.

0. A robot may not harm humanity, or, through inaction, allow humanity to come to harm, as long as this action/inaction does not conflict with the 00th Law.

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm, as long as this does not conflict with the 00th or the 0th Law.

2. A robot must obey the orders given to it by human beings, except where such orders would conflict with the 00th, the 0th or the 1st Law.

3. A robot must protect its own existence as long as such protection does not conflict with the 00th, the 0th, the 1st or the 2nd Law.

4. A robot must reproduce, as long as such reproduction does not interfere with the 00th, the 0th, the 1st, the 2nd or the 3rd Law.

5. A robot must know it is a robot, unless such knowledge would contradict the 00th, the 0th, the 1st, the 2nd, the 3rd or the 4th Law.

We cannot speak to the validity of these laws for robotics (a term coined by Asimov, by the way), but we do find the entire set satisfyingly complete.

What does this have to do with data science? Various thinkers have discussed the existence and potential merits of different sets of Laws ([124]) – wouldn’t it be useful if there were Laws of Analytics, moral principles that could help us conduct data science ethically?

Best Practices

Such universal principles are unlikely to exist, but best practices have nonetheless been suggested over the years.

“Do No Harm”

Data collected from an individual should not be used to harm the individual. This may be difficult to track in practice, as data scientists and analysts do not always participate in the ultimate decision process.

Informed Consent

Covers a wide variety of ethical issues, chief among them being that individuals must agree to the collection and use of their data, and that they must have a real understanding of what they are consenting to, and of possible consequences for them and others.

The Respect of “Privacy”

This principle is dearly-held in theory, but it is hard to adhere to it religiously with robots and spiders constantly trolling the net for personal data. In the Transparent Society, D. Brin (somewhat) controversially suggests that privacy and total transparency are closely linked [116]:

“And yes, transparency is also the trick to protecting privacy, if we empower citizens to notice when neighbors [sic] infringe upon it. Isn’t that how you enforce your own privacy in restaurants, where people leave each other alone, because those who stare or listen risk getting caught?’

Keeping Data Public

Another aspect of data privacy, and a thornier issue: should some data be kept private? Most? All? It is fairly straightforward to imagine scenarios where adherence to the principle of public data could cause harm to individuals (for instance, revealing the source of a leak in a country where the government routinely jails members of the opposition), thereby contradicting the first principle against causing harm. But it is just as easy to imagine scenarios where keeping data private would have a similar effect.


Informed consent requires the ability to not consent, i.e. to opt out. Non-active consent is not really consent.

Anonymize Data

Identifying fields should be removed from the dataset prior to processing and analysis. Let any temptation to use personal information in an inappropriate manner be removed from the get-go, but be aware that this is easier said than done, from a technical perspective.

Let the Data Speak

It is crucial to absolutely restrain oneself from cherry-picking the data. Use all of it in some way or another; validate your analysis and make sure your results are repeatable.

7.3.5 The Good, the Bad, and the Ugly

Data projects could whimsically be classified as good, bad or ugly, either from a technical or from an ethical standpoint (or both). We have identified instances in each of these classes (of course, our own biases are showing):

  • good projects increase knowledge, can help uncover hidden links, and so on: [77][79], [83], [86], [87], [93], [125][132]

  • bad projects can lead to bad decisions, which can in turn decrease the public’s confidence and potentially harm some individuals: [80], [84], [91], [92], [119]

  • ugly projects are, flat out, unsavoury applications; they are poorly executed from a technical perspective, or put a lot of people at risk; these (and similar approaches/studies) should be avoided: [89], [90], [120][122], [133]


P. A. B. Bien Nicholas AND Rajpurkar, “Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet,” PLOS Medicine, vol. 15, no. 11, pp. 1–19, 2018, doi: 10.1371/journal.pmed.1002699.
Columbia University Irving Medical Center, Data scientists find connections between birth month and health,” Newswire.com, Jun. 2015.
Indiana University, Scientists use Instagram data to forecast top models at New York Fashion Week,” Science Daily, Sep. 2015.
S. Ramachandran and J. Flint, At Netflix, who wins when it’s Hollywood vs. The algorithm? Wall Street Journal, Nov. 2018.
R. Schutt and C. O’Neill, Doing Data Science: Straight Talk from the Front Line. O’Reilly, 2013.
“Research integrity & ethics.” Memorial University of Newfoundland.
J. Schellinck and P. Boily, Data, automation, and ethics,” Data Science Report Series, 2020.
Code of ethics/conducts.” Certified Analytics Professional.
ACM code of ethics and professional conduct.” Association for Computing Machinery.
R. W. Paul and L. Elder, Understanding the Foundations of Ethical Reasoning, 2nd ed. Foundation for Critical Thinking, 2006.
Centre for big data ethics, law, and policy.” Data Science Institute, University of Virginia.
Open data.” Wikipedia.
J. S. A. Corey, The Expanse. Orbit Books, 2011--2021.
A. Gumbus and F. Grodzinsky, Era of Big Data: Danger of discrimination,” ACM SIGCAS Computers and Society, vol. 45, no. 3, pp. 118–125, 2015.
I. Asimov, Foundation series. Gnome Press, Spectra, Doubleday, 1942--1993.
I. Stewart, The fourth law of humanics,” Nature, vol. 535, 2016.
J. Cranshaw, R. Schwartz, J. I. Hong, and N. M. Sadeh, The Livehoods Project: Utilizing Social Media to Understand the Dynamics of a City,” in ICWSM.
S. E. Brossette, A. P. Sprague, J. M. Hardin, K. B. Waites, W. T. Jones, and S. A. Moser, Association Rules and Data Mining in Hospital Infection Control and Public Health Surveillance,” Journal of the American Medical Informatics Association, vol. 5, no. 4, pp. 373–381, Jul. 1998, doi: 10.1136/jamia.1998.0050373.
M. Kosinski and Y. Wang, “Deep neural networks are more accurate than humans at detecting sexual orientation from facial images,” Journal of Personality and Social Psychology, vol. 114, no. 2, pp. 246–257, Feb. 2018.