7.6 Exercises

  1. Write a paper discussing some of the ethical issues surrounding the use of artificial intelligence (A.I.), data science (D.S.), and/or machine learning (M.L.) algorithms in the public sector, the private sector, or in academia.

    • Establish a list of the 3 most important ethical principles that the use of such algorithms should abide by. Explain why you have selected each of these principles.

    • Describe (at least) 2 real-life instances of the use of A.I./D.S./M.L. in the public sector, the private sector, or in academia, when the ethical principles you have chosen were violated. Discuss how the failure to abide by your selected ethical principles have caused (or could cause) harm to individuals, organizations, countries, etc.

    • Suggest how the projects discussed above could have been modified so that their use of A.I./D.S./M.L. algorithms would abide by your selected ethical principles.

  2. Select a data project of interest to you (either personally or professionally) and provide a first planning draft for it, touching on the topics discussed in this module and in Non-Technical Aspects of Data Work. The following questions can help guide your proposal:

    • What are some questions associated with the project?

    • What is the conceptual model of the underlying situation?

    • What kind of dataset(s) exist that could help you answer these questions?

    • Are there data or analytical limitations?

    • Do you need to collect new data to handle such questions?

    • How is the data stored/accessed? What are the infrastructure requirements?

    • What do deliverables look like?

    • How would successes be quantified/qualified?

    • What are your timelines and availability?

    • What skillsets are required to work on this project?

    • Would you work on this alone or as part of a team?

    • How costly would it be to initiate and complete this project?

    • What does the data analysis pipeline look like?

    • What software and analytical methods will be used?

    • etc.

  3. The file cities.txt contains population information about a country’s cities. A city is classified as “small” if its population is below 75K, as “medium” if it falls between 75K and 1M, and as “large” otherwise.

    • Locate and load the file into the workspace of your choice. How many cities are there? How many in each group?

    • Display summary population statistics for the cities, both overall and by group.

  4. The remaining exercises use the Gapminder Tools (there is also an offline version).

    1. Take some time to explore the tool. In the online version, the default starting point is a bubble chart of 2020 life expectancy vs. income, per country (with bubble size associated with total population). In the offline version, select the “Bubbles” option.

    2. Can you identify the available variable categories and some of the variables? [You may need to dig around a bit.]

    3. Why do you think that Gapminder has selected Life Expectancy and Income as the default plotting variables?

    4. Replace Life Expectancy by Babies per woman. Observe and discuss the changes from the default plot.

    5. Formulate a few questions that could be answered with the default data.

    6. Formulate a few questions that could be answered using some of the other variables.

    7. At what point in the data science workflow do you think that visualizations of this nature could be useful?

    8. Do these visualizations provide a sound understanding of the system under investigation (the geopolitical Earth)?

    9. What do you think the data sources are for the underlying dataset? [You may need to dig around the internet to answer this question].

    10. Are all variables and measurements equally trustworthy? How could you figure this out?

    11. Is the underlying dataset structured or unstructured?

    12. Provide a potential data model for the dataset.

    13. What are the types of the 4 default variables (Life Expectancy, Income, Population, World Regions)?

    14. Play around with the charts for a bit. Can you find pairs of variables that are positively correlated? Negatively correlated? Uncorrelated?

    15. Among those variables that are correlated, do any seem to you to exhibit a dependent-independent relationship? How could you identify such pairs?

    16. Can you provide an eyeball estimate of the mean, the median, and the range of various numerical variables?

    17. Can you provide an eyeball estimate of the mode of the categorical variables?

    18. Can you identify epochal moments (special temporal points) in the data where a shift occurs, say?

    19. Is the tool and its underlying dataset useable? What factors does your answer depend on?