1.6 Python for Data Science

While Python remains a bona fide programming language, it is as a data science tool that its popularity has soared.

Let us take a look at some of its data functionality.

1.6.1 Pandas and Data Frames

The Pandas module provides Python with an equivalent of R data frames. Essentially, it is a two-dimensional tabular data structure in which each column can be of different value types.

In this section, we cover the basics of Pandas data frames (and introduce a dataset found in the Seaborn module, which is used for Data Visualization).

Comprehensive references for doing data analysis with Python include [16][18]. It is also handy to have the following cheat sheet:

We start by importing the required modules, with the customary aliases pd and sns:

import pandas as pd  
import seaborn as sns  

Loading Data

There are various ways to obtain data. One way is to use a pre-built sample dataset, such as titanic from seaborn.

titanic = sns.load_dataset("titanic")
type(titanic)
<class 'pandas.core.frame.DataFrame'>

Another way is to read a csv file using pandas.read_csv(). For instance, if the file calculus.csv is in the data folder, we would call:

calculus = pd.read_csv('data/calculus.csv')

We can look at the first rows using the head() method of a DataFrame object, such as:

titanic.head()
   survived  pclass     sex   age  ...  deck  embark_town  alive  alone
0         0       3    male  22.0  ...   NaN  Southampton     no  False
1         1       1  female  38.0  ...     C    Cherbourg    yes  False
2         1       3  female  26.0  ...   NaN  Southampton    yes   True
3         1       1  female  35.0  ...     C  Southampton    yes  False
4         0       3    male  35.0  ...   NaN  Southampton     no   True

[5 rows x 15 columns]

We can also look at the last rows using the tail() method,18 such as:

calculus.tail(6)
       ID Sex  Grade    GPA  Year
94  10095   F     69   6.49     1
95  10096   M     99  12.61     1
96  10097   M     40   4.17     2
97  10098   F     66   6.94     1
98  10099   M     83  10.09     1
99  10100   F     52   6.76     2

We can obtain a quick summary of a DataFrame object using the describe() method:

titanic.describe()
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

We can also obtain a summary of a subset of the columns:

df1 = titanic[['survived', 'age', 'fare']]
df1.describe()
         survived         age        fare
count  891.000000  714.000000  891.000000
mean     0.383838   29.699118   32.204208
std      0.486592   14.526497   49.693429
min      0.000000    0.420000    0.000000
25%      0.000000   20.125000    7.910400
50%      0.000000   28.000000   14.454200
75%      1.000000   38.000000   31.000000
max      1.000000   80.000000  512.329200

Or specific summary statistics on the full objects or on a specific column:

df1.mean()
print()
df1['age'].median()
survived     0.383838
age         29.699118
fare        32.204208
dtype: float64

28.0

Data Frame Operations

We continue with some basic operations on data frames. We will use another built-in dataset

crashes = sns.load_dataset('car_crashes')
crashes.head()
   total  speeding  alcohol  ...  ins_premium  ins_losses  abbrev
0   18.8     7.332    5.640  ...       784.55      145.08      AL
1   18.1     7.421    4.525  ...      1053.48      133.93      AK
2   18.6     6.510    5.208  ...       899.47      110.35      AZ
3   22.4     4.032    5.824  ...       827.34      142.39      AR
4   12.0     4.200    3.360  ...       878.41      165.63      CA

[5 rows x 8 columns]

New columns can be added to any data frame. In this example, we will generate a new column consisting of strings of the form Cnnn where nnn is a zero-padded three-digit number so that row 1, 2,… of crashes correspond to C001, C002, …

labels = ['C'+"{:03}".format(i+1) for i in range(crashes.shape[0])]
crashes['label'] = labels
crashes.head()
   total  speeding  alcohol  ...  ins_losses  abbrev  label
0   18.8     7.332    5.640  ...      145.08      AL   C001
1   18.1     7.421    4.525  ...      133.93      AK   C002
2   18.6     6.510    5.208  ...      110.35      AZ   C003
3   22.4     4.032    5.824  ...      142.39      AR   C004
4   12.0     4.200    3.360  ...      165.63      CA   C005

[5 rows x 9 columns]

Quite often, a particular column in a csv file serves as the index column. We can set this column to be an index column via the set_index() method:

df = crashes.set_index('label') 
df.head()
       total  speeding  alcohol  ...  ins_premium  ins_losses  abbrev
label                            ...                                 
C001    18.8     7.332    5.640  ...       784.55      145.08      AL
C002    18.1     7.421    4.525  ...      1053.48      133.93      AK
C003    18.6     6.510    5.208  ...       899.47      110.35      AZ
C004    22.4     4.032    5.824  ...       827.34      142.39      AR
C005    12.0     4.200    3.360  ...       878.41      165.63      CA

[5 rows x 8 columns]

Note that crashes is not affected by set_index(). To make the change directly to crashes, we would need to replace

df = crashes.set_index('label')

with

crashes.set_index('label', inplace=True)

We can subset a data frame by rows and columns labels via loc[], as in the examples below:

df.loc['C010':'C013',['speeding','total']]
       speeding  total
label                 
C010      3.759   17.9
C011      2.964   15.6
C012      9.450   17.5
C013      5.508   15.3
df.loc['C005':'C008',:]
       total  speeding  alcohol  ...  ins_premium  ins_losses  abbrev
label                            ...                                 
C005    12.0     4.200    3.360  ...       878.41      165.63      CA
C006    13.6     5.032    3.808  ...       835.50      139.91      CO
C007    10.8     4.968    3.888  ...      1068.73      167.02      CT
C008    16.2     6.156    4.860  ...      1137.87      151.48      DE

[4 rows x 8 columns]

We can also extract using position values via iloc[].

df.iloc[1:5,0:4]
       total  speeding  alcohol  not_distracted
label                                          
C002    18.1     7.421    4.525          16.290
C003    18.6     6.510    5.208          15.624
C004    22.4     4.032    5.824          21.056
C005    12.0     4.200    3.360          10.920

We can reset the index in a data frame via the reset_index() method. This has the effect of turning label into a data column like all other columns in the data frame df, for instance:

df.reset_index(inplace=True) 
df.head()
  label  total  speeding  alcohol  ...  no_previous  ins_premium  ins_losses  abbrev
0  C001   18.8     7.332    5.640  ...       15.040       784.55      145.08      AL
1  C002   18.1     7.421    4.525  ...       17.014      1053.48      133.93      AK
2  C003   18.6     6.510    5.208  ...       17.856       899.47      110.35      AZ
3  C004   22.4     4.032    5.824  ...       21.280       827.34      142.39      AR
4  C005   12.0     4.200    3.360  ...       10.680       878.41      165.63      CA

[5 rows x 9 columns]

It is possible to use the generator iterrows to yield both index and row of a data frame. For instance, the next block of code will print the labels corresponding to the first five rows.

for index, row in df[0:5].iterrows():
    print(row['label'])
C001
C002
C003
C004
C005

Columns and rows can be dropped from a data frame via the drop() method. In the example below, we drop the label column from df and assign the outcome to df2 (but note df itself is not changed):

df2 = df.drop('label', axis=1) 
df2.head()
   total  speeding  alcohol  ...  ins_premium  ins_losses  abbrev
0   18.8     7.332    5.640  ...       784.55      145.08      AL
1   18.1     7.421    4.525  ...      1053.48      133.93      AK
2   18.6     6.510    5.208  ...       899.47      110.35      AZ
3   22.4     4.032    5.824  ...       827.34      142.39      AR
4   12.0     4.200    3.360  ...       878.41      165.63      CA

[5 rows x 8 columns]

In contrast, the total column is dropped from df (and df is modified as a result):

df.drop('total', axis=1, inplace=True) 
df.head()
  label  speeding  alcohol  ...  ins_premium  ins_losses  abbrev
0  C001     7.332    5.640  ...       784.55      145.08      AL
1  C002     7.421    4.525  ...      1053.48      133.93      AK
2  C003     6.510    5.208  ...       899.47      110.35      AZ
3  C004     4.032    5.824  ...       827.34      142.39      AR
4  C005     4.200    3.360  ...       878.41      165.63      CA

[5 rows x 8 columns]

We can rename the columns of a data frame via the rename() method:

df.rename(columns={'label':'case', 'abbrev':'abbr'}, inplace=True)
df.head()
   case  speeding  alcohol  ...  ins_premium  ins_losses  abbr
0  C001     7.332    5.640  ...       784.55      145.08    AL
1  C002     7.421    4.525  ...      1053.48      133.93    AK
2  C003     6.510    5.208  ...       899.47      110.35    AZ
3  C004     4.032    5.824  ...       827.34      142.39    AR
4  C005     4.200    3.360  ...       878.41      165.63    CA

[5 rows x 8 columns]

What would we expect the following chunk of code to do?

newColumnNames = {}
for name in list(df):
    newColumnNames[name] = name.capitalize()
    
df2=df.rename(columns=newColumnNames)

Rows can be filtered according to a given condition. In the example below, b and d are Pandas series of booleans related to the df data frame:

b = df['ins_losses'] > 160 
d = df['not_distracted'] < 12

If we want to return the rows of df for which ins_losses is greater than 160 AND not_distracted ia less than 12, we would simply call:

df[b & d] 
    case  speeding  alcohol  ...  ins_premium  ins_losses  abbr
4   C005     4.200    3.360  ...       878.41      165.63    CA
6   C007     4.968    3.888  ...      1068.73      167.02    CT
20  C021     4.250    4.000  ...      1048.78      192.70    MD

[3 rows x 8 columns]

To return the rows of db for which ins_losses is greater than 160 OR abbr is equal to AL, we would call:

df[b | (df['abbr'] == 'AL')]
    case  speeding  alcohol  ...  ins_premium  ins_losses  abbr
0   C001     7.332    5.640  ...       784.55      145.08    AL
4   C005     4.200    3.360  ...       878.41      165.63    CA
6   C007     4.968    3.888  ...      1068.73      167.02    CT
18  C019     7.175    6.765  ...      1281.55      194.78    LA
20  C021     4.250    4.000  ...      1048.78      192.70    MD
36  C037     6.368    5.771  ...       881.51      178.86    OK

[6 rows x 8 columns]
Exercises
  1. Obtain the iris data set through seaborn and generate some summary statistics.

  2. Write code to change the labels in the data frame crashes from Cnnn to Incident nnn and turn that column into an index column. Commit these changes to crashes.

  3. Extract a data frame from df consisting only of the columns speeding and alcohol for which the speeding values are at least 3.0 and the alcohol values are at most 4.5.

  4. There is a powerful way to filter rows involving complex boolean expressions via the query() method. For instance,

df.query("ins_losses > 160 & ins_premium < 900 & abbr == 'CA'")
   case  speeding  alcohol  ...  ins_premium  ins_losses  abbr
4  C005       4.2     3.36  ...       878.41      165.63    CA

[1 rows x 8 columns]

Extract a data frame from df via query() consisting of records for which alcohol is at most 4.0 and abbr is neither CA nor LA.

1.6.2 Data Wrangling

We will now take a look at some ways to combine and clean data frames.

Merging and Joins

Consider a fictitious test score dataset. There are two sections in the class, contained in testA.csv and testB.csv. Each row consists of a student ID, a section, and a test mark. The file gpa.csv contains information on the students’ GPAs and their current year of study.

We start by reading in the two test score files (recall that pd is the alias for the pandas module).

dfA = pd.read_csv('data/testA.csv')
dfB = pd.read_csv('data/testB.csv')

The first entries of each sets are shown below:

dfA.head()
      ID Section  Mark
0  10021       A    47
1  10073       A    83
2  10084       A    51
3  10102       A    57
4  10175       A    71
dfB.head()
      ID Section  Mark
0  10011       B    97
1  10063       B    63
2  10094       B    71
3  10110       B    77
4  10133       B    81

We now read in the GPA information.

gpa = pd.read_csv('data/gpa.csv')
gpa.head()
   Student ID   GPA  Year
0       10011  12.0   3.0
1       10021   NaN   3.0
2       10063   5.6   3.0
3       10073   9.8   3.0
4       10084   6.2   3.0

Note that the column title for student ID is different in the test score files and in gpa.csv.

We now concatenate the two data frames of test scores into a single object using the pandas function concat().

df = pd.concat([dfA,dfB])

We now merge the GPA data frame with this combined test score data frame.

df3 = pd.merge(gpa, df, left_on='Student ID', right_on='ID')
df3
    Student ID   GPA  Year     ID Section  Mark
0        10011  12.0   3.0  10011       B    97
1        10021   NaN   3.0  10021       A    47
2        10063   5.6   3.0  10063       B    63
3        10073   9.8   3.0  10073       A    83
4        10084   6.2   3.0  10084       A    51
5        10094   8.1   NaN  10094       B    71
6        10102   6.9   2.0  10102       A    57
7        10110   8.4   2.0  10110       B    77
8        10133  10.4   2.0  10133       B    81
9        10145   5.1   2.0  10145       B    41
10       10162   7.2   2.0  10162       B    68
11       10175   6.9   1.0  10175       A    71
12       10189   6.1   1.0  10189       B    68
13       10190  11.2   1.0  10190       A    91
14       10199   NaN   1.0  10199       A    56

By default, merge() performs an inner join, but it can also perform outer joins.

Let us see what happens when we merge gpa with dfA.

pd.merge(gpa, dfA, left_on='Student ID', right_on='ID', how='outer').drop('Student ID', axis=1)
     GPA  Year       ID Section  Mark
0   12.0   3.0      NaN     NaN   NaN
1    NaN   3.0  10021.0       A  47.0
2    5.6   3.0      NaN     NaN   NaN
3    9.8   3.0  10073.0       A  83.0
4    6.2   3.0  10084.0       A  51.0
5    8.1   NaN      NaN     NaN   NaN
6    6.9   2.0  10102.0       A  57.0
7    8.4   2.0      NaN     NaN   NaN
8   10.4   2.0      NaN     NaN   NaN
9    5.1   2.0      NaN     NaN   NaN
10   7.2   2.0      NaN     NaN   NaN
11   6.9   1.0  10175.0       A  71.0
12   6.1   1.0      NaN     NaN   NaN
13  11.2   1.0  10190.0       A  91.0
14   NaN   1.0  10199.0       A  56.0

We can see that there is a row for every row in gpa and that only those rows for which Student ID is present in dfA have merged data (what happens if the .drop('Student ID', axis=1) is omitted?).

Data Cleansing

Note that in the merged data frame df3 (and in gpa), there are rows containing NaN. If we do not want any rows with such values, we can use the dropna() method.

df3.dropna()
    Student ID   GPA  Year     ID Section  Mark
0        10011  12.0   3.0  10011       B    97
2        10063   5.6   3.0  10063       B    63
3        10073   9.8   3.0  10073       A    83
4        10084   6.2   3.0  10084       A    51
6        10102   6.9   2.0  10102       A    57
7        10110   8.4   2.0  10110       B    77
8        10133  10.4   2.0  10133       B    81
9        10145   5.1   2.0  10145       B    41
10       10162   7.2   2.0  10162       B    68
11       10175   6.9   1.0  10175       A    71
12       10189   6.1   1.0  10189       B    68
13       10190  11.2   1.0  10190       A    91

We can also drop only the rows with NaN in specific columns. If we do not want to retain observations with Year==NaN, we would call:

gpa.dropna(subset=['Year']) 
    Student ID   GPA  Year
0        10011  12.0   3.0
1        10021   NaN   3.0
2        10063   5.6   3.0
3        10073   9.8   3.0
4        10084   6.2   3.0
6        10102   6.9   2.0
7        10110   8.4   2.0
8        10133  10.4   2.0
9        10145   5.1   2.0
10       10162   7.2   2.0
11       10175   6.9   1.0
12       10189   6.1   1.0
13       10190  11.2   1.0
14       10199   NaN   1.0

Instead of dropping rows containing NaN, we could replace the unwanted values with some other chosen value instead (like 0, say).

gpa.fillna(0)
    Student ID   GPA  Year
0        10011  12.0   3.0
1        10021   0.0   3.0
2        10063   5.6   3.0
3        10073   9.8   3.0
4        10084   6.2   3.0
5        10094   8.1   0.0
6        10102   6.9   2.0
7        10110   8.4   2.0
8        10133  10.4   2.0
9        10145   5.1   2.0
10       10162   7.2   2.0
11       10175   6.9   1.0
12       10189   6.1   1.0
13       10190  11.2   1.0
14       10199   0.0   1.0

Note that all the NaNs are changed to 0.0. To change only the GPA volume, we can do the following (note that this will modify the original gpa data frame):

gpa.fillna({'GPA':0.0})
    Student ID   GPA  Year
0        10011  12.0   3.0
1        10021   0.0   3.0
2        10063   5.6   3.0
3        10073   9.8   3.0
4        10084   6.2   3.0
5        10094   8.1   NaN
6        10102   6.9   2.0
7        10110   8.4   2.0
8        10133  10.4   2.0
9        10145   5.1   2.0
10       10162   7.2   2.0
11       10175   6.9   1.0
12       10189   6.1   1.0
13       10190  11.2   1.0
14       10199   0.0   1.0

We can apply a function to a data frame column using the method map(). The following will add a Grade column to dfA, containing Pass or Fail based on the Mark column.

def markToGrade(x):
    res = 'Fail'
    if x >= 50:
        res = 'Pass'
    return res

dfA['Grade'] = dfA['Mark'].map(markToGrade)
dfA
      ID Section  Mark Grade
0  10021       A    47  Fail
1  10073       A    83  Pass
2  10084       A    51  Pass
3  10102       A    57  Pass
4  10175       A    71  Pass
5  10190       A    91  Pass
6  10199       A    56  Pass
Exercises
  1. Obtain a data frame df4 by changing the column name of Student ID in the data frame gpa to ID. Then create df5 by merging df4 and df using pd.merge(df4, df, on='ID') and summarize the resulting data frame.

  2. Perform an outer join with df4 from the previous exercise and dfB.

  3. Drop the observations in the original gpa data frame for which the only NaN values are found in the GPA column.

  4. Replace the NaN in the original gpa’s Year column with the string Unknown.

  5. Modify markToGrade so that a mark between 80 to 100 (incusive) is converted to an A, a mark at least 70 but less than 80 is converted to a B, a mark at least 60 but less than 70 is converted to a C, a mark at least 50 but less than 60 is converted to a D, and a mark below 50 is converted to an F.

def markToGrade(x):
    res = 'F'
    # Your code here
    return res

Add a Grade column to df3 containing the converted grades.

1.6.3 Data Aggregation

Sometimes, the data in a dataset can be divided into groups. We might want to obtain summary statistics for each group. Analyses by groups and aggregation can help us obtain insight on groups.

Summaries by Groups

We first illustrate obtaining simple statistics on groups using a dataset containing calculus marks (recall that pd is the pandas alias).

calc = pd.read_csv('data/calculus.csv')
calc.head()
      ID Sex  Grade   GPA  Year
0  10001   F     47  5.02     2
1  10002   M     57  3.82     1
2  10003   M     91  7.70     1
3  10004   M     71  4.82     1
4  10005   F     83  7.91     1

Suppose that we want to see separate mean grades and mean GPA based on the Sex variables. We can use the groupby() method to perform the task:

calc[['Sex','Grade','GPA']].groupby('Sex').mean()
         Grade       GPA
Sex                     
F    67.901961  6.539804
M    64.408163  5.609388

If we want the means, the standard deviations, and the medians for Grade and GPA grouped by Sex, we can use the more general method agg(). Note that We first need to import numpy (alias np) to access these simple statistics functions.

calc[['Sex','Grade','GPA']].groupby('Sex').agg([np.mean, np.std, np.median])
         Grade                         GPA                 
          mean        std median      mean       std median
Sex                                                        
F    67.901961  20.162594   66.0  6.539804  3.008527   6.24
M    64.408163  16.237711   62.0  5.609388  2.756965   4.77

If we are interested in the Grade mean and the GPA median, grouped by Sex, we can use a dictionary to specify which function is applied to which column as follows:

calc[['Sex','Grade','GPA']].groupby('Sex').agg({'Grade': np.mean, 'GPA': np.median})
         Grade   GPA
Sex                 
F    67.901961  6.24
M    64.408163  4.77

We can also build custom aggregate functions. The following chunk of code computes the sum of squares for the Grade and GPA columns.

def sumOfSquares(xs):
    return np.dot(xs,xs)

calc[['Sex','Grade','GPA']].groupby('Sex').agg(sumOfSquares)

Pivot Tables

We could also have obtained the mean Grade and mean GPA for the Sex groups via pivot_table(), as below:

calc[['Sex','Grade','GPA']].pivot_table(index='Sex', aggfunc=np.mean)

To obtain a pivot table displaying the number of students in each Year grouped by Sex, we can run the following code:

calc[['Sex','Year']].pivot_table(index='Sex', columns=['Year'],aggfunc=len, margins=False)

We can also print the margins (totals) by changing the appropriate option:

calc[['Sex','Year']].pivot_table(index='Sex', columns=['Year'],aggfunc=len, margins=True)
Exercises
  1. Obtain the mean for each of the Year groups in the calc data frame.

  2. Obtain the mean, standard deviation, and median for each of the Year groups in the calc data frame, using agg().

  3. Produce a summary of the calc data frame giving the Grade mean and standard deviation, and the GPA median, grouped by Years.

  4. Complete the definition of a function that returns Satisfactory if the average of the array x is at least 65.0 and Unsatisfactory otherwise.

def groupStatus(arr):
    res = ''
    # Your code here
    
    return res

Determine the group status in the calc dataset by both Sex and Year, for the Grade variable.

  1. Write a function that produces the pivot table displaying the number of students with a passing grade by Sex and Year (hint: if arr is a NumPy array, then arr >= 50.0 gives an array of the same length such that element i is True if and only if a[i] >= 50.0).

1.6.4 Combining Python and R

Ask most data scientist and they will tell you that they are a Python person or a R person (or more rarely, a Julia person).

In practice, Python is probably best for data processing (in terms of efficiency, especially with large datasets), while R has a package (or three!) for pretty much any statistical and data visualization task under the sun.

It is now possible to use Python within R through the reticulate package.19.

The reticulate vignette contains detailed information on the process; for the time being, we will only give a small example detailing how this could be achieved, based on [19].

library(reticulate)

We start by creating a variable x in the Python session:

x = list(range(8))

Once that is done, we can access the Python variable x from R; it is a column in the py data frame:

str(py)
py$x
Module(__main__)
[1] 0 1 2 3 4 5 6 7

We can also create new variables y in the Python session from R, and pass a data frame to y:

py$y <- head(AirPassengers) # a built-in R dataset

This variable can now be displayed in the the Python session, and operated on, as needed:

print(y)
[112.0, 118.0, 132.0, 129.0, 121.0, 135.0]

It is not difficult to imagine how to expand this back and forth to more complex data analysis situations, leaving us the option of picking whatever language is best suited to a specific task.

References

[16]
J. VanderPlas, Python Data Science Handbook : Essential tools for working with data. Sebastopol, CA: O’Reilly Media, Inc, 2016.
[18]
J. Kazil and K. Jarmul, Data Wrangling with Python: Tips and tools to make your life easier. O’Reilly Media, 2016.
[19]
Y. Xie, C. Dervieux, and E. Riederer, R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC, 2020.