17.4 Examples

In this section, we provide web scraping examples (in R and Python) that highlight some of the notions we discussed in the chapter. Keep in mind that by the time you get to this part of the book, it is possible that the websites that are being scraped have changed their structure.

17.4.1 Wikipedia

This example is inspired by a task found in [352]. We analyze the list of largest cities on the planet, found on Wikipedia.335

Take a moment to familiarize yourself with the page. What information does it contain? In particular, do you understand what the table’s columns are recording?

Preamble

We will be using the following R libraries:

  • stringr, stringi, and strex, for string manipulation;

  • XML, for reading and creating XML documents;

  • maps, to display maps, and

  • rvest, which provides a wrapper for HTTP requests in R.

Loading and Parsing the Data

We read the material from the Wikipedia website using rvest’s read_html() command, and we store it to the object html.

html <- rvest::read_html("https://en.wikipedia.org/wiki/List_of_largest_cities")

A call to the object shows the entire structure of the page under the hood (so to speak).

html
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

Now that we have the information from the webpage, we parse it to create a string of words.

cities_parsed <- XML::htmlParse(html, encoding="UTF-8")

Note that this new output contains the same information as the original object html, but that if it was displayed, it would be so in a format resembling what a human programer would expect to see (at least, to some extent). We opt not to display it due to its excessive length.

Now that the information from the webpage is parsed, we create tables to hold the words, using XML’s readHTMLTable().

tables <- XML::readHTMLTable(cities_parsed, stringsAsFactors = FALSE)

Essentially, readHTMLTable() hunts for <table>...</table> tag pairs in the file; it finds 4 here. We get some structural information by calling str on the resulting object tables.

str(tables)
List of 4
 $ NULL:'data.frame':   5 obs. of  1 variable:
  ..$ V1: chr [1:5] "Ekistics" "" "List of largest cities\nList of cities proper by population density\nConurbation\nMegacity\nMegalopolis\nSettlement hierarchy" " Cities portal" ...
 $ NULL:'data.frame':   84 obs. of  13 variables:
  ..$ V1 : chr [1:84] "City[a]" "Definition" " " "Tokyo" ...
  ..$ V2 : chr [1:84] "Country" "Population" "" " Japan" ...
  ..$ V3 : chr [1:84] "UN 2018 population estimates[b]" "Area.mw-parser-output .nobold{font-weight:normal}(km2)" "" "37,468,000" ...
  ..$ V4 : chr [1:84] "City proper[c]" "Density(/km2)" "" "Metropolis prefecture" ...
  ..$ V5 : chr [1:84] "Urban area[8]" "Population" "" "13,515,271" ...
  ..$ V6 : chr [1:84] "Metropolitan area[d]" "Area(km2)" "" "2,191" ...
  ..$ V7 : chr [1:84] NA "Density(/km2)" "" "6,169[13]" ...
  ..$ V8 : chr [1:84] NA "Population" "" "39,105,000" ...
  ..$ V9 : chr [1:84] NA "Area(km2)" "" "8,231" ...
  ..$ V10: chr [1:84] NA "Density(/km2)" "" "4,751[e]" ...
  ..$ V11: chr [1:84] NA NA "" "37,274,000" ...
  ..$ V12: chr [1:84] NA NA "" "13,452" ...
  ..$ V13: chr [1:84] NA NA "" "2,771[14]" ...
 $ NULL:'data.frame':   7 obs. of  2 variables:
  ..$ V1: chr [1:7] "v\nt\ne\n\nWorld's largest cities" "City proper" "Metropolitan area" "Urban area/agglomeration" ...
  ..$ V2: chr [1:7] NA "Capitals\nAfrica\n\nAmericas (North\n\nLatin\nCentral\n\nSouth)\n\nAsia (Arab world\n\nASEAN\nArabia\n\nEast, S"| __truncated__ "Americas (North\n\nSouth\n\nWest Indies)\n\nEurope (European Union)\nAsia\nAfrica\nMiddle East\nOceania" "Asia\nAfrica\n\nEurope\nEuropean Union\nNordic\n\nNorth America\nOceania\nTransborder" ...
 $ NULL:'data.frame':   9 obs. of  2 variables:
  ..$ V1: chr [1:9] "v\nt\ne\n\nCities" "Urban geography" "Urban government" "Urban economics" ...
  ..$ V2: chr [1:9] NA "Urban area\n\nCity centre\nDowntown\nSuburb\nExurb\nCore city\nTwin cities\nSatellite city\nEdge city\nCommuter"| __truncated__ "City status\n\nMunicipality\ndirect-controlled\nIndependent city\nCity-state\nAutonomous city\nCapital city" "World's cities by GDP\nGlobal city\nCentral business district\nMost expensive cities\nCheapest cities\nMost liv"| __truncated__ ...

We can display the information of the second table (tables[[2]]) in a format that is easier to read:

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
City[a] Country UN 2018 population estimates[b] City proper[c] Urban area[8] Metropolitan area[d] NA NA NA NA NA NA NA
Definition Population Area.mw-parser-output .nobold{font-weight:normal}(km2) Density(/km2) Population Area(km2) Density(/km2) Population Area(km2) Density(/km2) NA NA NA
 
Tokyo  Japan 37,468,000 Metropolis prefecture 13,515,271 2,191 6,169[13] 39,105,000 8,231 4,751[e] 37,274,000 13,452 2,771[14]
Delhi  India 28,514,000 Capital City 16,753,235 1,484 11,289[15] 31,870,000 2,233 14,272[f] 29,000,000 3,483 8,326[16]
Shanghai  China 25,582,000 Municipality 24,870,895 6,341 3,922[17][18] 22,118,000 4,069 5,436[g]

Data Processing and Data Cleaning

Let us extract the table containing the information of interest, the second one:

cities_table <- tables[[2]]

The column headers are not as we might want them:

colnames(cities_table)
 [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
[13] "V13"

Compare with the second row of cities_table:

cities_table[2,]
          V1         V2                                                     V3
2 Definition Population Area.mw-parser-output .nobold{font-weight:normal}(km2)
             V4         V5        V6            V7         V8        V9
2 Density(/km2) Population Area(km2) Density(/km2) Population Area(km2)
            V10  V11  V12  V13
2 Density(/km2) <NA> <NA> <NA>

This is still not ideal: the first and second rows of the table contain variable information, and the data itself starts with row 3. We need to manually input the column names, and delete the non-data rows.

colnames(cities_table) <- c("city", "country", "un.2018.pop","city.def",
                            "city.pop","city.area","city.den",
                            "metro.pop","metro.area","metro.den",
                            "urban.pop","urban.area","urban.den")
cities_table <- data.frame(cities_table[4:nrow(cities_table),])

We select the columns of the table that are of interest to us:

  • city;

  • country;

  • urban.pop;

  • urban.area, and

  • urban.den,

that is to say, columns 1, 2, 11, 12, and 13.

cities_table <- cities_table[,c(1,2,11,12,13)]

It is never a bad idea to validate our work as we build the scraper: are we getting what we would expect along the way? Let us take a look at the structure of the data and compare the first 6 entries of the table to the information we can see on the wikipedia page.

str(cities_table)
'data.frame':   81 obs. of  5 variables:
 $ city      : chr  "Tokyo" "Delhi" "Shanghai" "São Paulo" ...
 $ country   : chr  " Japan" " India" " China" " Brazil" ...
 $ urban.pop : chr  "37,274,000" "29,000,000" "—" "21,734,682" ...
 $ urban.area: chr  "13,452" "3,483" "—" "7,947" ...
 $ urban.den : chr  "2,771[14]" "8,326[16]" "—" "2,735[20]" ...
head(cities_table)
city country urban.pop urban.area urban.den
4 Tokyo  Japan 37,274,000 13,452 2,771[14]
5 Delhi  India 29,000,000 3,483 8,326[16]
6 Shanghai  China
7 São Paulo  Brazil 21,734,682 7,947 2,735[20]
8 Mexico City  Mexico 21,804,515 7,866 2,772[22]
9 Cairo  Egypt

We see that all variables appear as character strings, and that there are oddities with some of the numerical values.

We can then extract the numerical values using stringr’s str_extract() and regexps(), or strex’s str_extract_numbers() and str_first_number().

The urban populations are all above 5M, and they are all displayed using comma separators, thus they all have values that look like ddd,ddd,ddd, where d \(\in \{0,1,2,3,4,5,6,7,8,9\}\). We extract only the portion of the strings that follow this pattern from the population column using str_extract() (which does not retain the footnote markers), removing the commas after the fact (using gsub()), and coercing the outcome to a numerical format (using as.numeric()).

cities_table$urban.pop <- as.numeric(
  gsub(",", "", 
       stringr::str_extract(
         cities_table$urban.pop,
         stringr::regex("\\d+,\\d+,\\d+")
         )
       )
  )

The area column contains no footnote, so we can directly extract the comma-separated values (using str_extract_numbers()) and coerce to a vector using as.numeric() (otherwise, the output would be a list).

cities_table$urban.area <- as.numeric(
  strex::str_extract_numbers(
    cities_table$urban.area,
    commas=TRUE
    )
  )

Finally, we extract the first number that appears in each density value, removing the footnotes (both characters and numeric), using str_first_number(); the result is then coerced to a numeric vector using as.numeric().

cities_table$urban.den <- 
  as.numeric(strex::str_first_number(
    cities_table$urban.den,
    commas=TRUE
    )
  )  

The first entries are shown below.

city country urban.pop urban.area urban.den
Tokyo  Japan 37274000 13452 2771
Delhi  India 29000000 3483 8326
Shanghai  China NA NA NA
São Paulo  Brazil 21734682 7947 2735
Mexico City  Mexico 21804515 7866 2772
Cairo  Egypt NA NA NA

We can also download latitude and longitude details for roughly 41K cities.

world_cities = read.csv("data/worldcities.csv", stringsAsFactors = TRUE, nrow=200)
str(world_cities)
'data.frame':   200 obs. of  3 variables:
 $ city_ascii: Factor w/ 198 levels "Abidjan","Ahmedabad",..: 162 70 37 110 101 141 139 140 105 47 ...
 $ lat       : num  35.69 -6.21 28.66 18.97 14.6 ...
 $ lng       : num  139.7 106.8 77.2 72.8 121 ...

We extract a 5-digit code for each city, in the hope of being able to match them in both datasets. We remove accents using stringi’s stri_trans_general(), which will convert every character to its nearest equivalent in the Latin ASCII character list.

world_cities$code = stringi::stri_trans_general(
  tolower(
    substr(
      world_cities$city_ascii,1,5)
    ), 
  "Latin-ASCII"
  )

cities_table$code = stringi::stri_trans_general(
  tolower(
    substr(cities_table$city,1,5)
    ), 
  "Latin-ASCII"
  )

Let us merge the data frames:

complete = merge(cities_table, world_cities, all.x=TRUE)
knitr::kable(complete)
code city country urban.pop urban.area urban.den city_ascii lat lng
ahmed Ahmedabad  India 6300000 NA NA Ahmedabad 23.0300 72.5800
alexa Alexandria  Egypt NA NA NA Alexandria 31.2000 29.9167
atlan Atlanta  United States 5949951 21690 274 Atlanta 33.7627 -84.4224
baghd Baghdad  Iraq NA NA NA Baghdad 33.3500 44.4167
banga Bangalore  India NA NA NA Bangalore 12.9699 77.5980
bangk Bangkok  Thailand 16255900 7762 2094 Bangkok 13.7500 100.5167
barce Barcelona  Spain 5474482 NA NA Barcelona 41.3825 2.1769
beiji Beijing  China NA NA NA Beijing 39.9050 116.3914
belo Belo Horizonte  Brazil 5156217 9459 545 Belo Horizonte -19.9281 -43.9419
bogot Bogotá  Colombia 12545272 5934 2114 Bogota 4.6126 -74.0705
bueno Buenos Aires  Argentina 12806866 NA NA Buenos Aires -34.5997 -58.3819
cairo Cairo  Egypt NA NA NA Cairo 30.0561 31.2394
cheng Chengdu  China NA NA NA Chengdu 30.6600 104.0633
chenn Chennai  India NA NA NA Chennai 13.0825 80.2750
chica Chicago  United States 9618502 18640 516 Chicago 41.8373 -87.6862
chong Chongqing  China NA NA NA Chongqing 29.5628 106.5528
dalia Dalian  China NA NA NA Dalian 38.9000 121.6000
dalla Dallas  United States 7470158 22463 333 Dallas 32.7936 -96.7662
dar e Dar es Salaam  Tanzania NA NA NA Dar es Salaam -6.8000 39.2833
delhi Delhi  India 29000000 3483 8326 Delhi 28.6600 77.2300
dhaka Dhaka  Bangladesh 14543124 NA NA Dhaka 23.7289 90.3944
dongg Dongguan  China NA NA NA Dongguan 23.0475 113.7493
fosha Foshan  China NA NA NA Foshan 23.0292 113.1056
fukuo Fukuoka  Japan NA NA NA NA NA NA
guada Guadalajara  Mexico 5286642 3560 1485 Guadalajara 20.6767 -103.3475
guang Guangzhou  China NA NA NA Guangzhou 23.1288 113.2590
hangz Hangzhou  China NA NA NA Hangzhou 30.2500 120.1675
harbi Harbin  China NA NA NA Harbin 45.7500 126.6333
ho ch Ho Chi Minh City  Vietnam NA NA NA Ho Chi Minh City 10.8167 106.6333
hong Hong Kong  China NA NA NA Hong Kong 22.3050 114.1850
houst Houston  United States 6997384 21395 327 Houston 29.7863 -95.3889
hyder Hyderabad  India NA NA NA Hyderabad 17.3667 78.4667
istan Istanbul  Turkey NA NA NA Istanbul 41.0100 28.9603
jakar Jakarta  Indonesia 33430285 7063 4733 Jakarta -6.2146 106.8451
jinan Jinan  China NA NA NA Jinan 36.6667 116.9833
johan Johannesburg  South Africa NA NA NA Johannesburg -26.2044 28.0416
karac Karachi  Pakistan 16051521 3780 4246 Karachi 24.8600 67.0100
khart Khartoum  Sudan 5274321 NA NA Khartoum 15.6031 32.5265
kinsh Kinshasa  DR Congo NA NA NA Kinshasa -4.3233 15.3081
kolka Kolkata  India 14035959 1851 7583 Kolkata 22.5411 88.3378
kuala Kuala Lumpur  Malaysia 7200000 2793 2578 Kuala Lumpur 3.1478 101.6953
lagos Lagos  Nigeria 21000000 1171 17933 Lagos 6.4500 3.4000
lahor Lahore  Pakistan NA NA NA Lahore 31.5497 74.3436
lima Lima  Peru 9569468 2819 3395 Lima -12.0500 -77.0333
londo London  United Kingdom 14372596 8382 1715 London 51.5072 -0.1275
los a Los Angeles  United States 13291486 12559 1058 Los Angeles 34.1139 -118.4068
luand Luanda  Angola NA NA NA Luanda -8.8383 13.2344
madri Madrid  Spain 6641649 NA NA Madrid 40.4167 -3.7167
manil Manila  Philippines 12877253 620 20770 Manila 14.6000 120.9833
mexic Mexico City  Mexico 21804515 7866 2772 Mexico City 19.4333 -99.1333
miami Miami  United States 6158824 15890 388 Miami 25.7839 -80.2102
mosco Moscow  Russia 20004462 NA NA Moscow 55.7558 37.6178
mumba Mumbai  India 24400000 4355 5603 Mumbai 18.9667 72.8333
nagoy Nagoya  Japan 9363000 7271 1288 Nagoya 35.1167 136.9333
nanji Nanjing  China NA NA NA Nanjing 32.0500 118.7667
new y New York  United States 20140470 12093 1665 New York 40.6943 -73.9249
osaka Osaka  Japan 19303000 13228 1459 Osaka 34.7500 135.4601
paris Paris  France 12244807 NA NA Paris 48.8566 2.3522
phila Philadelphia  United States 6096120 NA NA Philadelphia 40.0077 -75.1339
pune Pune  India 7276000 7256 1003 Pune 18.5196 73.8553
qingd Qingdao  China NA NA NA Qingdao 36.1167 120.4000
rio d Rio de Janeiro  Brazil 12644321 5327 2374 Rio de Janeiro -22.9083 -43.1964
riyad Riyadh  Saudi Arabia NA NA NA Riyadh 24.6500 46.7100
saint Saint Petersburg  Russia NA NA NA Saint Petersburg 59.9500 30.3167
santi Santiago  Chile 7112808 15403 462 Santiago -33.4500 -70.6667
sao p São Paulo  Brazil 21734682 7947 2735 Sao Paulo -23.5504 -46.6339
seoul Seoul  South Korea 25514000 11704 2180 Seoul 37.5600 126.9900
shang Shanghai  China NA NA NA Shangrao 28.4419 117.9633
shang Shanghai  China NA NA NA Shanghai 31.1667 121.4667
shang Shanghai  China NA NA NA Shangqiu 34.4259 115.6467
sheny Shenyang  China NA NA NA Shenyang 41.8039 123.4258
shenz Shenzhen  China NA NA NA Shenzhen 22.5350 114.0540
singa Singapore  Singapore NA NA NA Singapore 1.3000 103.8000
surat Surat  India NA NA NA Surat 21.1700 72.8300
suzho Suzhou  China NA NA NA Suzhou 31.3040 120.6164
suzho Suzhou  China NA NA NA Suzhou 33.6333 116.9683
tehra Tehran  Iran NA NA NA Tehran 35.7000 51.4167
tianj Tianjin  China NA NA NA Tianjin 39.1467 117.2056
tokyo Tokyo  Japan 37274000 13452 2771 Tokyo 35.6897 139.6922
toron Toronto  Canada 5928040 5906 1004 Toronto 43.7417 -79.3733
washi Washington  United States 6263245 17009 368 Washington 38.9047 -77.0163
wuhan Wuhan  China NA NA NA Wuhan 30.5872 114.2881
xi’an Xi’an  China NA NA NA Xi’an 34.2667 108.9000
yango Yangon  Myanmar NA NA NA NA NA NA

There are some issues with the outcome:

  • Suzhou shows up twice, with two different sets of coordinates, but the appropriate coordinates are found online to be (31.299999, 120.599998);

  • Neither Yangon nor Fukuoka appear in the world_cities dataset, but their coordinates are found online to be (16.871311,96.199379) and (33.583332,130.399994), respectively;

  • Shanghai has been associated to three cities: Shanghai, Shangrao, and Shangqiu, each with its own coordinates. As neither Shangrao nor Shangqiu appears in the original list, they may be removed with impunity, and

  • there is no population data for Foshan, but the Wikipedia page informs us that Foshan is included in the Guangzhou urban area, so we will remove the former from the dataset.

complete = complete[-c(23,68,70,76),c(7,3,8,9,4,5,6)] # remove duplicate entries
rownames(complete) = NULL

# add Fukuoka coordinates
complete[23,3] = 33.5833
complete[23,4] = 130.3999 

# add Yangon coordinates
complete[80,3] = 16.8713
complete[80,4] = 96.1994 

# add new factor levels for missing city names
complete$city_ascii = factor(complete$city_ascii,
                             levels=c(levels(complete$city_ascii),
                                      "Yangon","Fukuoka"))
complete[23,1] = "Fukuoka"
complete[80,1] = "Yangon"

# rename city_ascii to city
colnames(complete)[1] <- "city"

The final dataset is thus:

city country lat lng urban.pop urban.area urban.den
Ahmedabad  India 23.0300 72.5800 6300000 NA NA
Alexandria  Egypt 31.2000 29.9167 NA NA NA
Atlanta  United States 33.7627 -84.4224 5949951 21690 274
Baghdad  Iraq 33.3500 44.4167 NA NA NA
Bangalore  India 12.9699 77.5980 NA NA NA
Bangkok  Thailand 13.7500 100.5167 16255900 7762 2094
Barcelona  Spain 41.3825 2.1769 5474482 NA NA
Beijing  China 39.9050 116.3914 NA NA NA
Belo Horizonte  Brazil -19.9281 -43.9419 5156217 9459 545
Bogota  Colombia 4.6126 -74.0705 12545272 5934 2114
Buenos Aires  Argentina -34.5997 -58.3819 12806866 NA NA
Cairo  Egypt 30.0561 31.2394 NA NA NA
Chengdu  China 30.6600 104.0633 NA NA NA
Chennai  India 13.0825 80.2750 NA NA NA
Chicago  United States 41.8373 -87.6862 9618502 18640 516
Chongqing  China 29.5628 106.5528 NA NA NA
Dalian  China 38.9000 121.6000 NA NA NA
Dallas  United States 32.7936 -96.7662 7470158 22463 333
Dar es Salaam  Tanzania -6.8000 39.2833 NA NA NA
Delhi  India 28.6600 77.2300 29000000 3483 8326
Dhaka  Bangladesh 23.7289 90.3944 14543124 NA NA
Dongguan  China 23.0475 113.7493 NA NA NA
Fukuoka  Japan 33.5833 130.3999 NA NA NA
Guadalajara  Mexico 20.6767 -103.3475 5286642 3560 1485
Guangzhou  China 23.1288 113.2590 NA NA NA
Hangzhou  China 30.2500 120.1675 NA NA NA
Harbin  China 45.7500 126.6333 NA NA NA
Ho Chi Minh City  Vietnam 10.8167 106.6333 NA NA NA
Hong Kong  China 22.3050 114.1850 NA NA NA
Houston  United States 29.7863 -95.3889 6997384 21395 327
Hyderabad  India 17.3667 78.4667 NA NA NA
Istanbul  Turkey 41.0100 28.9603 NA NA NA
Jakarta  Indonesia -6.2146 106.8451 33430285 7063 4733
Jinan  China 36.6667 116.9833 NA NA NA
Johannesburg  South Africa -26.2044 28.0416 NA NA NA
Karachi  Pakistan 24.8600 67.0100 16051521 3780 4246
Khartoum  Sudan 15.6031 32.5265 5274321 NA NA
Kinshasa  DR Congo -4.3233 15.3081 NA NA NA
Kolkata  India 22.5411 88.3378 14035959 1851 7583
Kuala Lumpur  Malaysia 3.1478 101.6953 7200000 2793 2578
Lagos  Nigeria 6.4500 3.4000 21000000 1171 17933
Lahore  Pakistan 31.5497 74.3436 NA NA NA
Lima  Peru -12.0500 -77.0333 9569468 2819 3395
London  United Kingdom 51.5072 -0.1275 14372596 8382 1715
Los Angeles  United States 34.1139 -118.4068 13291486 12559 1058
Luanda  Angola -8.8383 13.2344 NA NA NA
Madrid  Spain 40.4167 -3.7167 6641649 NA NA
Manila  Philippines 14.6000 120.9833 12877253 620 20770
Mexico City  Mexico 19.4333 -99.1333 21804515 7866 2772
Miami  United States 25.7839 -80.2102 6158824 15890 388
Moscow  Russia 55.7558 37.6178 20004462 NA NA
Mumbai  India 18.9667 72.8333 24400000 4355 5603
Nagoya  Japan 35.1167 136.9333 9363000 7271 1288
Nanjing  China 32.0500 118.7667 NA NA NA
New York  United States 40.6943 -73.9249 20140470 12093 1665
Osaka  Japan 34.7500 135.4601 19303000 13228 1459
Paris  France 48.8566 2.3522 12244807 NA NA
Philadelphia  United States 40.0077 -75.1339 6096120 NA NA
Pune  India 18.5196 73.8553 7276000 7256 1003
Qingdao  China 36.1167 120.4000 NA NA NA
Rio de Janeiro  Brazil -22.9083 -43.1964 12644321 5327 2374
Riyadh  Saudi Arabia 24.6500 46.7100 NA NA NA
Saint Petersburg  Russia 59.9500 30.3167 NA NA NA
Santiago  Chile -33.4500 -70.6667 7112808 15403 462
Sao Paulo  Brazil -23.5504 -46.6339 21734682 7947 2735
Seoul  South Korea 37.5600 126.9900 25514000 11704 2180
Shanghai  China 31.1667 121.4667 NA NA NA
Shenyang  China 41.8039 123.4258 NA NA NA
Shenzhen  China 22.5350 114.0540 NA NA NA
Singapore  Singapore 1.3000 103.8000 NA NA NA
Surat  India 21.1700 72.8300 NA NA NA
Suzhou  China 31.3040 120.6164 NA NA NA
Tehran  Iran 35.7000 51.4167 NA NA NA
Tianjin  China 39.1467 117.2056 NA NA NA
Tokyo  Japan 35.6897 139.6922 37274000 13452 2771
Toronto  Canada 43.7417 -79.3733 5928040 5906 1004
Washington  United States 38.9047 -77.0163 6263245 17009 368
Wuhan  China 30.5872 114.2881 NA NA NA
Xi’an  China 34.2667 108.9000 NA NA NA
Yangon  Myanmar 16.8713 96.1994 NA NA NA

17.4.1.1 Visualization

All the work we have done has brought the data in a format that is amenable to analysis.

As an illustration, let us plot the cities on a map of the world. We can display a Mercator projection by using maps’s map() functionality.

par(oma=c(0,0,0,0))
par(mar=c(0,0,0,0))
maps::map("world", col = "darkgrey", lwd = .5, mar = c(0.1,0.1,0.1,0.1))
points(complete$lng, complete$lat, col = "black", cex = .8)
title("Locations of the 80 most populuous urban areas", line=1)
box()

We could also provide a bubble chart of the urban areas, showing New York as an outlier:

ggplot2::ggplot(complete,ggplot2::aes(
            x=urban.pop,
            y=urban.area,
            size=urban.den,
            colour=urban.den,
            label=city)) +
  ggplot2::geom_point() + ggplot2::theme_bw() +
  ggplot2::xlab("urban population") + 
  ggplot2::ylab("urban area (log scaled)") + 
  ggplot2::guides(colour = ggplot2::guide_legend(reverse = TRUE)) +
  ggplot2::scale_y_log10() + 
  ggplot2::geom_text(ggplot2::aes(
            label=ifelse(urban.pop>23000000,
                         as.character(city),'')), 
            size=3, hjust=1,vjust=2) + 
  ggplot2::labs(colour = "urban density", size = "urban density") +
  ggplot2::ggtitle("Relationship between population, area, and density", 
                   subtitle="80 most populous urban areas")

17.4.2 Weather Data

Simple web requests require some networking code to fetch a page and return the HTML contents.

When a user is extracting data from HTML, they typically search for certain HTML elements and read the data they contain (data could be text, tables, form field values, images, videos, etc.).

In this example, we build a function that extracts the 7-day forecast for the main Canadian cities location.

Preamble

The first step is to load the various Python modules that will be required.

from bs4 import BeautifulSoup       # to parse and process the data 
from urllib.request import urlopen  # to open URLs
import numpy as np                  # to do some numerics eventually
import pandas                       # to do some analysis eventually

Loading and Parsing the Data

We get a handle on the website structure by studying the page for a single location, say Ottawa, Ontario (this version of the code requires that it be run before 3PM EST; the various webpages have a different format in the evenings, unfortunately).

ottawaURL = "https://weather.gc.ca/city/pages/on-118_metric_e.html"

The page looks something like the image below.

7-day forecast for Ottawa, ON, on a particular date. [weather.gc.ca]

Figure 17.12: 7-day forecast for Ottawa, ON, on a particular date. [weather.gc.ca]

Let us download the HTML and load it into Beautiful Soup, using html.parser (other parsers can also be used, depending on the kind of files with which we work).

ottawaHTML = urlopen(ottawaURL)
ottawaBS = BeautifulSoup(ottawaHTML, 'html.parser')

The soup (parsed content) is now available in ottawaBS. The data of interest is in there, we just need to pick it out of the document.

If we open developer tools pane in our browser, we can examine the specific HTML elements that contain the numbers we want. The table with the 7 day forecast appears to correspond to div element with class=div-table; the weather information is contained in 7 columns, each of which is a div element with class=div-column.

7-day forecast for Ottawa, ON, on a particular date; the `div` element with `class=div-table` is highlighted in the Firefox Inspector. [weather.gc.ca]

Figure 17.13: 7-day forecast for Ottawa, ON, on a particular date; the div element with class=div-table is highlighted in the Firefox Inspector. [weather.gc.ca]

7-day forecast for Ottawa, ON, on a particular date; the `div` element with `class=div-column` is highlighted in the Firefox Inspector. [weather.gc.ca]

Figure 17.14: 7-day forecast for Ottawa, ON, on a particular date; the div element with class=div-column is highlighted in the Firefox Inspector. [weather.gc.ca]

We can find it in the soup ottawaBS as follows:

sevenDaysBS = ottawaBS.find_all('div', attrs={"class" : "div-column"})

We display the HTML for the first of those columns below.

print(sevenDaysBS[0].prettify())
<div class="div-column">
 <div class="div-row div-row1 div-row-head">
  <a href="/forecast/hourly/on-118_metric_e.html">
   <strong title="Friday">
    Fri
   </strong>
   <br/>
   7
   <abbr title="October">
    Oct
   </abbr>
  </a>
 </div>
 <a class="linkdate" href="/forecast/hourly/on-118_metric_e.html">
  <div class="div-row div-row2 div-row-data">
   <img alt="Chance of showers" class="center-block" height="51" src="/weathericons/12.gif" width="60"/>
   <p class="mrgn-bttm-0">
    <span class="high wxo-metric-hide" title="max">
     10°
     <abbr title="Celsius">
      C
     </abbr>
     <span class="abnTrend">
      *
     </span>
    </span>
    <span class="high wxo-imperial-hide wxo-city-hidden" title="max">
     50°
     <abbr title="Fahrenheit">
      F
     </abbr>
     <span class="abnTrend">
      *
     </span>
    </span>
   </p>
   <p class="mrgn-bttm-0 pop text-center" title="Chance of Precipitation">
    <small>
     60%
    </small>
   </p>
   <p class="mrgn-bttm-0">
    Chance of showers
   </p>
  </div>
 </a>
 <div class="div-row div-row3 div-row-head">
  Tonight
 </div>
 <div class="div-row div-row4 div-row-data">
  <img alt="Mainly cloudy" class="center-block" height="51" src="/weathericons/33.gif" width="60"/>
  <p class="mrgn-bttm-0">
   <span class="low wxo-metric-hide" title="min">
    0°
    <abbr title="Celsius">
     C
    </abbr>
   </span>
   <span class="low wxo-imperial-hide wxo-city-hidden" title="min">
    32°
    <abbr title="Fahrenheit">
     F
    </abbr>
   </span>
  </p>
  <p class="mrgn-bttm-0 pop text-center">
  </p>
  <p class="mrgn-bttm-0">
   Mainly cloudy
  </p>
 </div>
</div>

In each of the columns, the first row contains the date, and the second the maximum forecast temperature during the day (before 3PM).

7-day forecast for Ottawa, ON, on a particular date; a `div` element with `class=div-row1` is highlighted in the Firefox Inspector. [weather.gc.ca]

Figure 17.15: 7-day forecast for Ottawa, ON, on a particular date; a div element with class=div-row1 is highlighted in the Firefox Inspector. [weather.gc.ca]

7-day forecast for Ottawa, ON, on a particular date; a `div` element with `class=div-row2` is highlighted in the Firefox Inspector. [weather.gc.ca]

Figure 17.16: 7-day forecast for Ottawa, ON, on a particular date; a div element with class=div-row2 is highlighted in the Firefox Inspector. [weather.gc.ca]

We can extract the strings in each of the first two cells of the first column using the .strings method, as below:

list(sevenDaysBS[0].find(class_="div-row div-row1 div-row-head").strings) # date
list(sevenDaysBS[0].find(class_="high wxo-metric-hide").strings) # temp
['Fri', '7\xa0', 'Oct']
['10°', 'C', '*']

The lists contains the information of interest, together with additional characters; for both variables, we join the list elements into a single string and remove the odd characters (°C, \xa0,*), using the .replace() method:

' '.join(list(sevenDaysBS[0].find(class_="div-row div-row1 div-row-head").strings)).replace("\xa0","").replace("*","")
''.join(list(sevenDaysBS[0].find(class_="high wxo-metric-hide").strings)).replace("°C","").replace("*","")
'Fri 7 Oct'
'10'

Based on this work, we now write functions that extract a 7-day forecast, the corresponding dates, the city name, and the province code given a URL of the right format. Additional cleaning is required (see the various .replace() calls).

def sevenDayForecast(url):
    html = urlopen(url)
    htmlBS = BeautifulSoup(html, 'html.parser')
    sevenDaysBS = htmlBS.find_all('div', attrs={"class" : "div-column"}) 
    temp_degree = []
    for day in sevenDaysBS:
        temp_de = int(
          ''.join(list(day.find(class_="high wxo-metric-hide").strings)).replace("°C","").replace("*","")
                  )
        temp_degree.append(temp_de)
    return temp_degree

def sevenDayForecastDates(url):
    html = urlopen(url)
    htmlBS = BeautifulSoup(html, 'html.parser')
    sevenDaysBS = htmlBS.find_all('div', attrs={"class" : "div-column"}) 
    temp_date = []
    for day in sevenDaysBS:
        temp_da = ' '.join(list(day.find(class_="div-row div-row1 div-row-head").strings)).replace("\xa0","").replace("\n ","").replace(" \n","").replace("*","")
        temp_date.append(temp_da)
    return temp_date

def cityName(url):
    html = urlopen(url)
    htmlBS = BeautifulSoup(html, 'html.parser')
    nameBS = htmlBS.find('h1', attrs={"property" : "name"}) 
    city_name = list(nameBS.strings)[0].replace(" \n","").replace(", ","")
    return city_name

def provinceCode(url):
    html = urlopen(url)
    htmlBS = BeautifulSoup(html, 'html.parser')
    nameBS = htmlBS.find('h1', attrs={"property" : "name"}) 
    province_code = list(nameBS.strings)[1]
    return province_code

Take the time to verify that you understand what the functions are meant to do. We can validate them on the Ottawa URL (this will only work if the code is run before 3PM EST, as the format of the webpage changes after that time).

sevenDayForecast(ottawaURL)
sevenDayForecastDates(ottawaURL)
cityName(ottawaURL)
provinceCode(ottawaURL)
[10, 11, 13, 12, 14, 17, 15]
['Fri 7 Oct', 'Sat 8 Oct', 'Sun 9 Oct', 'Mon 10 Oct', 'Tue 11 Oct', 'Wed 12 Oct', 'Thu 13 Oct']
'Ottawa (Kanata - Orléans)'
'ON'

Data Processing

We now prepare the data for analysis.

We will select the 20 Canadian cities that appear on the website’s main page.

For each of these cities, we will extract the 7-day forecast, and display today’s “prediction”, tomorrow’s prediction, the weekly change 1 week from today, and the mean prediction over the 7-day forecast.

This could be done manually by feeding the URL to each of the 4 functions defined above (in the example for Ottawa), but we will use Beautiful Soup to scrape the information automatically (and cleanly).

We will be including only minimal comments for this section. Can you make out what is happening at every step? Don’t hesitate to visit the corresponding web pages if the context is not clear from the code and the output.

We start by finding the URL for each of the cities on the main page.

wURL = "https://weather.gc.ca/canada_e.html"
wHTML = urlopen(wURL) 
wBS = BeautifulSoup(wHTML, 'html.parser')
tableBS = wBS.find('table', attrs={"class" : "table table-hover table-striped table-condensed"}) 
citiesBS = tableBS.find_all('a', href=True)

citiesFURLs = []
for a in citiesBS:
    temp = a['href']
    citiesFURLs.append(temp)

citiesURLs = ["https://weather.gc.ca" + citiesFURLs[index] for index in range(len(citiesFURLs))]

Next we build a dictionary containing the desired data, for each city:

today_date = sevenDayForecastDates(citiesURLs[0])[0]

row_dict = []
for row in range(len(citiesURLs)):
    d = dict()
    tmp = sevenDayForecast(citiesURLs[row])
    d['city'] = cityName(citiesURLs[row])
    d['province'] = provinceCode(citiesURLs[row])
    d['date'] = today_date
    d['today'] = tmp[0]
    d['tomorrow'] = tmp[1]
    d['1 week change'] = np.subtract(tmp[6],tmp[0])
    d['weekly mean'] = np.mean(tmp)
    row_dict.append(d)

Finally, we covert the dictionary into a pandas data frame:

wDF = pandas.DataFrame(row_dict)
wDF
                         city province  ... 1 week change  weekly mean
0                     Calgary       AB  ...             6    20.000000
1               Charlottetown       PE  ...            -3    14.142857
2                    Edmonton       AB  ...             0    20.285714
3                 Fredericton       NB  ...            -4    15.571429
4                     Halifax       NS  ...            -2    15.285714
5                     Iqaluit       NU  ...             2    -1.142857
6                    Montréal       QC  ...             0    14.571429
7   Ottawa (Kanata - Orléans)       ON  ...             5    13.142857
8               Prince George       BC  ...            -2    16.142857
9                      Québec       QC  ...            -4    12.714286
10                     Regina       SK  ...             3    18.428571
11                  Saskatoon       SK  ...             0    19.000000
12                 St. John's       NL  ...            -2    12.714286
13                Thunder Bay       ON  ...             2    11.857143
14                    Toronto       ON  ...             4    14.857143
15                  Vancouver       BC  ...            -2    18.571429
16                   Victoria       BC  ...            -2    20.285714
17                 Whitehorse       YT  ...           -13    10.571429
18                   Winnipeg       MB  ...             3    14.285714
19                Yellowknife       NT  ...            -9     7.571429

[20 rows x 7 columns]

Visualization

As a last exercise, we will provide a basic visualization for the dataset.

import seaborn as sns
sns.set(style='whitegrid')

We use seaborn’s pairplot() to produce the scatterplot matrix of the data, and matplotlib’s plt() to display it.

cols = ['today', 'tomorrow', '1 week change', 'weekly mean']
sns.pairplot(data=wDF[cols], size=2.5)
**CAPTION**

Figure 17.17: CAPTION

Perhaps unsurprisingly, there does not seem to be much insight available in the dataset. If there is an association between tomorrow’s prediction and the prediction one week from now, we require more information to explore it; data collected on a daily basis, perhaps (in which case, it would be useful to save the data)?

That is an important point to keep in mind: the process is sometimes long and complicated, but that does not always translate into insight at the end of the day (unless the absence of an apparent link is insight… which it could very well be).

17.4.3 CFL Play-by-Play

The goal in this example is to obtain structured play-by-play data for past CFL (Canadian Football League) games. We could use this information to ask questions such as:

  • how often does a team convert on 3rd and X?

  • how often has a team come back from a 7+ pt deficit in the 4th quarter?

  • etc.

Preamble

Before you start, make sure that Beautiful Soup, Selenium, Pandas, Firefox, and Geckodriver are installed in your Python environment.

You can use the code below to install the python modules.

  • pip3 install beautifulsoup4

  • pip3 install pandas

  • pip3 install selenium

  • etc.

You can get download information for Firefox and Geckodriver here:

Of course, other browsers have their own installation information. We will use the following Python modules for pulling data out of HTML and XML files (BeautifulSoup), for dealing with potentially dynamic websites (Selenium), to open URLs (urllib.request), and other regular tasks.

from bs4 import BeautifulSoup          
from pyvirtualdisplay import Display
from selenium import webdriver          
from urllib.request import urlopen     
import csv
import pandas
import time
import warnings; warnings.filterwarnings('ignore')

17.4.3.1 Game Schedule

Let us start by getting a list of all games in a season; we will switch to processing data on a game-by-game basis at a later stage. All games in a season (2016, say) are listed at a single URL in the following format.

year = 2016
scheduleURL = 'https://www.cfl.ca/schedule/?season={}'.format(year)

This produces the following URL:

scheduleURL
'https://www.cfl.ca/schedule/?season=2016'

Now we open the schedule page and parse it with BeautifulSoup:

scheduleHTML = urlopen(scheduleURL)
scheduleBS = BeautifulSoup(scheduleHTML, 'html.parser')

We could display the HTML code with:

scheduleBS

Warning: the HTML file contains a lot of information, so the display has been suppressed. For comparison’s sake, this is what the page looks like:

We could sift through the HTML to try to find what each piece of code corresponds to on the page, but that’s not the most efficient approach to use.

Instead, we use the Developer Tools to get a better idea. In the example below, let’s say we are interested in the second pre-season game (a 25-16 victory by the Argonauts over the Tiger-Cats).

Right-click on the box containing the game information, and select “Inspect Element (Q)” from the menu that appears. In the Developer Tool, you will be taken to the section of HTML code corresponding to the element you selected (you might need to try right-clicking over a few locations as there sub-elements in the game box).

Each game is represented by a row. According to developer tools, these rows are div elements with the class heading collapsible-header.

Scrolling down on the schedule page, it appears as though every game is presented in the same format, so it is worth a shot to ask Beautiful Soup to find all rows that contain the class heading collapsible-header.

scheduleRows = scheduleBS.findAll(class_='heading collapsible-header') 
# note the "_" after class, and the single quotes

Here’s a better view of a single row, with some parts omitted:

We want the URL that the “GAMETRACKER” button links to – this is the game page that contains the play-by-play info. The link is found in the data-url attribute (rather than in the href attribute).

We can get the link for the 2nd pre-season game by querying scheduleRows[1] (recall that list indexing starts with 0 in Python).

row = scheduleRows[1]
button = row.find(class_='gametracker')
button['data-url']
'https://www.cfl.ca/games/2268/hamilton-tiger-cats-vs-toronto-argonauts/'

These are all the steps we need to get the list of game page URLs for an entire season. We might also want to store each of these game pages in a Python array. This can be done as follows.

urls = []

for row in scheduleRows:
    button = row.find(class_='gametracker')
    url = button['data-url']
    urls.append(url)

# print(urls) # uncomment to display the URLs   
df = pandas.DataFrame(urls)
df.to_csv(path_or_buf='Data/CFL_Schedule_2016.csv', header=False)

Incidentally, how many games were played in total in 2016, including the pre-season and the playoffs? That’s easy to answer:

len(urls)
95

Scraping Game Data

Here is a URL for one particular game.

gameURL = 'https://www.cfl.ca/games/2391/ottawa-redblacks-vs-toronto-argonauts'

The screenshot below shows the page after clicking the “PLAY BY PLAY” button.

This page actually only loads the play-by-play data when the “PLAY BY PLAY” button is pressed. If we download the HTML before pressing the button, the data just isn’t there.

gameBS = BeautifulSoup(urlopen(gameURL))
gameBS.text.count('Kickoff')
0

The page does contain JavaScript that tells the browser to fetch more data when the button is clicked and add it to the page. The most straightforward way to get this data is to run a browser but control it automatically. All we need is a way to identify the button to press.

Luckily the button has a (unique) id (see above), so we can use that. We define an XPATH string for that id.

pbp_btn_xpath = '//*[@id="playbyplay-tab"]'

For browser automation we’ll use Firefox with Selenium (make sure that geckodriver is installed, or whatever the appropriate driver is for your browser).

In the next block, we run code for the driver object (Selenium, controlling Firefox), telling it to load the page, click the button, and then get the HTML. Depending on your system, executable_path will vary.

display = Display(visible=0, size=(1440, 1080))
display.start()
driver = webdriver.Firefox(executable_path='/usr/local/bin/geckodriver')

# Open the page
driver.get(gameURL)

# Wait for loading
time.sleep(5) 
# less about robots.txt but more about content "physically" being there

# Click button to get play-by-play data
playbyplay_btn = driver.find_element_by_xpath(pbp_btn_xpath)
playbyplay_btn.click()

# Wait again for loading
time.sleep(5)

# Take HTML and save in BS object
soup = BeautifulSoup(driver.page_source)
driver.close()

Now that we have the HTML of the loaded page, we can extract data as usual with Beautiful Soup, such as finding the home team and the away teams, and so on.

# away, home
[soup.find(class_='js-data-team_2_location').text,
soup.find(class_='js-data-team_1_location').text]
['Ottawa', 'Toronto']

17.4.4 Bad HTML

If you write a R/Python program with incorrect syntax, you’ll get an error and your program won’t work. If you write an HTML page with incorrect syntax, there’s a good chance that browsers will be able to make sense of it anyway. Browsers do this by guessing ways to correct each error.

You can check whether a webpage on the internet uses correct syntax or not by entering the URL at https://validator.w3.org.

If what we see in our browsers is a fixed-up version of the HTML, then when we parse HTML with python we’d like to be able to get a similarly fixed-up version. Let’s look at some simple examples of how Beautiful Soup handles bad HTML.

from bs4 import BeautifulSoup

First we pass Beautiful Soup a proper (albeit incomplete) HTML document:

goodBS = BeautifulSoup(
    '<html><head><title>blah</title></head><body></body></html>',
    'html.parser')

As expected, we can operate with the parsed document, such as finding elements and getting their data.

goodBS.find('title').text
'blah'

Now what if we omit the closing </title> tag? We print the corrected version that Beautiful Soup builds.

badBS = BeautifulSoup(
    '<html><head><title>blah</head><body></body></html>',
    'html.parser')
print(badBS)
<html><head><title>blah</title></head><body></body></html>

You see that the closing tag has been returned. Similar behaviour is seen in the following examples where tags are misplaced or omitted. Note that although <li> (list item) tags are supposed to be put inside a list tag such as <ul> (unordered list) or <ol> (ordered list), Beautiful Soup doesn’t add those tags.

badBS = BeautifulSoup(
    '<html><head></head><body><li><em>hi</body></em></html>',
    'html.parser')
print(badBS.prettify())
<html>
 <head>
 </head>
 <body>
  <li>
   <em>
    hi
   </em>
  </li>
 </body>
</html>
badBS = BeautifulSoup(
    '<html><head></head><body><li><em>hi<li></body></em></html>',
    'html.parser')
print(badBS.prettify())
<html>
 <head>
 </head>
 <body>
  <li>
   <em>
    hi
    <li>
    </li>
   </em>
  </li>
 </body>
</html>

In general, if the browser can do a good enough job to render an HTML page as intended, we can trust Beautiful Soup to fix things up in a logical manner. But when we automate the data collection process, we do not usually visit each page before it is scraped; there might be surprises in store!

17.4.5 Extracting Text from a PDF File

Apache Tika can be used to convert PDF files to TXT files, but a few R libraries can also do so (and are potentially easier to use, depending on the document’s structure).

We use the pdftools library to extract text from the DAL Data Visualization Learning Map.

library(pdftools)
Using poppler version 22.02.0
DAL <- pdf_text("https://www.data-action-lab.com/wp-content/uploads/2020/01/Learning-Map-Data-Visualization-ACFO.pdf")
length(DAL)
N <- 1:length(DAL)
[1] 7

What does the first page look like, say?

DAL[1]
[1] "                                                         DATA\n     DATA EXPLORATION AND DATA VISUALIZATION\n                                                       ACTION\n                                                          LAB\n\n\n\n\n                                          Data Analytics\n                                        Conference – ACFO\n                                        Data Exploration and Data\n                                              Visualization\n\n                                                 Learning Map\n\n\nab\n\n\n\n\n                                          Provided by the Data Action Lab\n     Canada School of Public Service   Digital Academy\n"

The text contains a fair amount of “noise” (we will revisit text cleaning in detail in Text Analysis and Text Mining).

library(stringr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
removeDiacritics <- function(string) {
  chartr(
    "SZszYÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ",
    "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy",
    string
  )
}

DAL.clean=c()

for(i in 1:length(DAL)){
  DAL.clean[i] <- DAL[i] %>%
    { gsub("\n", " ", .) } %>%
    { gsub("\\d", " ", .) } %>%
    { gsub("\\s+", " ", .) } %>%  
    trimws() 
}

We might chose to only keep those pages with at least 1000 characters, say:

index <- nchar(DAL.clean)>1000
DAL.clean <- DAL.clean[index]
length(DAL.clean)
[1] 6

Let us take a look at one of the pages:

DAL.clean[6]
[1] "DAL instructors have consulted for DATA EXPLORATION (and taught to participants from) a AND DATA VISUALIZATION variety of groups, a selection of which is shown below: § Canada Revenue Agency § Canada School of Public Service’s Digital Academy § Canadian Air Transport Security Authority § Canadian Coast Guard § Canadian Food Inspection Agency § Canadian Institute for Health Information § The Children’s Hospital of Eastern Ontario § Communications Research Centre Canada § Department of National Defence § Environment and Climate Change Canada § Fisheries and Ocean Canada § Health Canada § Immigration, Refugees and Citizenship Canada § Indigenous and Northern Affairs Canada § Natural Resources Canada § Nuclear Waste Management Organization § Office of the Privacy Commissioner of Canada § Privy Council Office § Public Services and Procurement Canada § Royal Canadian Mounted Police § Transport Canada § Treasury Board Secretariat Consult our Data Training Catalogues for a list of practical data analysis and data leadership courses. Visit data-action-lab.com or contact info@data-action-lab.com for more information. DATA ACTION LAB | info@data-action-lab.com"

Not too shabby.

17.4.6 YouTube Titles

In this example, we will see how to use the YouTube API to scrape the titles of YouTube videos.

from apiclient.discovery import build
from apiclient.errors import HttpError
from oauth2client.tools import argparser

# some of these modules will only be useful for the exerices
import pandas
from functional import seq
import codecs
import glob
import html
import os
import re
import unidecode
import urllib
import urllib.request
import warnings; warnings.filterwarnings('ignore')

Authentication

The task is to build or add to a corpus of text by fetching video transcripts from YouTube. To use the YouTube API, we need to authenticate ourselves. Create a config.json file in the main directory that looks like this:

{ "DEVELOPER_KEY": "your_key_here" }

Instructions on how to obtain a key are provided here

Once that done, we create an object youtube.

config = seq.json('config.json').dict()

DEVELOPER_KEY = config['DEVELOPER_KEY']
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"

youtube = build(YOUTUBE_API_SERVICE_NAME,
                YOUTUBE_API_VERSION,
                developerKey=DEVELOPER_KEY)

YouTube API

The object youtube, through which we can access YouTube API methods (info available at https://en.wikipedia.org/wiki/YouTube_API, https://www.youtube.com/yt/dev/, https://www.youtube.com/yt/dev/api-resources.html).

We could hand-pick videos to read, but we’ll take a shortcut by getting all the transcripts in a whole playlist of videos.

The first task is taking a playlist ID and using the API to get the video IDs of each entry in the playlist. The API for getting entries in a playlist is paginated. This means that we have to make one request for the first chunk of entries, then make another request to get some more entries, and so on until we’ve got the entire playlist.

It’s designed this way so that we don’t download more than we need; for example if we were building an infinite scrolling menu, we wouldn’t want to load everything up front.

After we’ve got the first chunk (in this example, 10 videos at a time), we need to tell the API where to start the next chunk. This is done using a page token.

We simply take the nextPageToken of the response we get, and pass it to the API for the next request, until the API returns no nextPageToken value.

def fetch_playlist_videos(playlistId):
    '''
    get all videos in a playlist.
    Returns: list of dictionaries representing playlistItem resources,
    see https://developers.google.com/youtube/v3/docs/playlistItems#resource-representation
    for the structure of this resource
    '''
    
    # API method: https://developers.google.com/youtube/v3/docs/playlistItems/list
    res = youtube.playlistItems().list(
        part="snippet",
        playlistId=playlistId,
        maxResults="10").execute()
    
    nextPageToken = res.get('nextPageToken')
    while ('nextPageToken' in res):
        nextPage = youtube.playlistItems().list(
            part="snippet",
            playlistId=playlistId,
            maxResults="10",
            pageToken=nextPageToken).execute()
        res['items'] = res['items'] + nextPage['items']
        
        if 'nextPageToken' not in nextPage:
            res.pop('nextPageToken', None)
        else:
            nextPageToken = nextPage['nextPageToken']
        
    return res['items']

Playlist Extraction

The playlist entries come in the form of playlistItem resource dictionaries. The playlistItem resource is a data format defined in the API documentation that contains fields for all the information that’s associated with an item in a playlist. In the python API, the object is a nested dictionary. We want to get to the video ID, and we’ll do that for all playlist items.

First, take some time to explore the following YouTube playlist: https://www.youtube.com/playlist?list=PLbVTnkp2K536WxfoqSvoY08aJ3sLBg9mI

# some playlists with English transcripts available
IQCPlaylist = ['PLbVTnkp2K536WxfoqSvoY08aJ3sLBg9mI']

videos = []
for playlistID in IQCPlaylist:
    videos += fetch_playlist_videos(playlistID)

Let’s explore that a bit by looking at the 3rd video in the playlist, say.

print(videos[2])
{'kind': 'youtube#playlistItem', 'etag': 'LYF0Osi6rdPvtzVAqAY-TobYZnE', 'id': 'UExiVlRua3AySzUzNld4Zm9xU3ZvWTA4YUozc0xCZzltSS4xMkVGQjNCMUM1N0RFNEUx', 'snippet': {'publishedAt': '2020-06-20T21:18:23Z', 'channelId': 'UCIi6fq-A7sTT4iBDQUKekyg', 'title': 'IQC - 1.3 - Guiding Principles (10:46)', 'description': '1.3.1 Best Practices\n1.3.2 The Good, the Bad, and the Ugly', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/eodNQzJFJpg/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/eodNQzJFJpg/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/eodNQzJFJpg/hqdefault.jpg', 'width': 480, 'height': 360}, 'standard': {'url': 'https://i.ytimg.com/vi/eodNQzJFJpg/sddefault.jpg', 'width': 640, 'height': 480}, 'maxres': {'url': 'https://i.ytimg.com/vi/eodNQzJFJpg/maxresdefault.jpg', 'width': 1280, 'height': 720}}, 'channelTitle': 'Patrick Boily', 'playlistId': 'PLbVTnkp2K536WxfoqSvoY08aJ3sLBg9mI', 'position': 2, 'resourceId': {'kind': 'youtube#video', 'videoId': 'eodNQzJFJpg'}, 'videoOwnerChannelTitle': 'Patrick Boily', 'videoOwnerChannelId': 'UCIi6fq-A7sTT4iBDQUKekyg'}}

We get list of all video IDs and their titles as follows:

videoIDs = [ video['snippet']['resourceId']['videoId'] for video in videos ]
videotitles = [ video['snippet']['title'] for video in videos ]
print(videoIDs)
['-dZImvCSPKI', '0vBXkgiJIP8', 'eodNQzJFJpg', 'IiQJ1G4QJWg', 'ycBovk3EtfQ', 'RErsLHdKFSM', '5eu_FoJu7uo', 'HUzosM19QCs', 'n4Z3SgEJ4bg', 'P-jkx_XdJlw', 'LFS6RbpzLSw', 'OhrtH6sGbtA', 'bIO4JmGVf_k', 'LUU_UKk2YyQ', 'dgtapT4n484', '1cRmNcT1pvo', 'Ga6VEPk_HfY', 'Q2o8bIV6328', '-ZLuiE0j8Ts', 'WqTH3OvPKxQ', '_9eUuc_-z9s', 'ITGBjuOwY4w', 'cQvCq1_Eoms', 'erP8Xc0hO0U', 'mb7p4B2spP0', 'KG4SBzXccEk', 'A9Wh4L7ZJr0', 'CL_cVCZ5l7Q', 'yxP4Nz09rSE']

We put this information into a dictionary:

yt_dict = []
for row in range(len(videoIDs)):
    d = dict()
    d['youtubeURL'] =  'https://youtu.be/{}'.format(videoIDs[row])
    d['title'] = videotitles[row]
    yt_dict.append(d)

and convert the dictionary to a Pandas dataframe:

ytDF = pandas.DataFrame(yt_dict)
ytDF
                      youtubeURL                                              title
0   https://youtu.be/-dZImvCSPKI       IQC - 1.1 - The Consulting Framework (12:08)
1   https://youtu.be/0vBXkgiJIP8         IQC - 1.2 - Ethical Considerations (16:01)
2   https://youtu.be/eodNQzJFJpg             IQC - 1.3 - Guiding Principles (10:46)
3   https://youtu.be/IiQJ1G4QJWg     IQC - 1.4 - Asking the Right Questions (04:45)
4   https://youtu.be/ycBovk3EtfQ          IQC - 1.5 - The Structure of Data (19:07)
5   https://youtu.be/RErsLHdKFSM  IQC - 1.6 - Quantitative Consulting Workflows ...
6   https://youtu.be/5eu_FoJu7uo     IQC - 1.7 - Roles and Responsibilities (14:24)
7   https://youtu.be/HUzosM19QCs         IQC - 1.8 - Consulting Cheat Sheet (09:05)
8   https://youtu.be/n4Z3SgEJ4bg      IQC - 2.0 - The Consulting Life Cycle (05:27)
9   https://youtu.be/P-jkx_XdJlw                      IQC - 2.1 - Marketing (23:37)
10  https://youtu.be/LFS6RbpzLSw                IQC - 2.2 - Initial Contact (02:34)
11  https://youtu.be/OhrtH6sGbtA                       IQC - 2.3 - Meetings (07:40)
12  https://youtu.be/bIO4JmGVf_k            IQC - 2.4 - Assembling the Team (02:20)
13  https://youtu.be/LUU_UKk2YyQ             IQC - 2.3 (Reprise) - Meetings (01:57)
14  https://youtu.be/dgtapT4n484  IQC - 2.5 - Proposal and Project Planning (16:24)
15  https://youtu.be/1cRmNcT1pvo             IQC - 2.6 - Contracting and IP (05:17)
16  https://youtu.be/Ga6VEPk_HfY  IQC - 2.5 (Reprise) - Proposal and Project Pla...
17  https://youtu.be/Q2o8bIV6328          IQC - 2.7 - Information Gathering (09:40)
18  https://youtu.be/-ZLuiE0j8Ts                       IQC - 2.8 - Analysis (03:16)
19  https://youtu.be/WqTH3OvPKxQ       IQC - 2.9 - Interpreting the Results (02:50)
20  https://youtu.be/_9eUuc_-z9s    IQC - 2.10 - Reporting and Deliverables (11:54)
21  https://youtu.be/ITGBjuOwY4w                     IQC - 2.11 - Invoicing (07:25)
22  https://youtu.be/cQvCq1_Eoms              IQC - 2.12 - Closing the File (03:43)
23  https://youtu.be/erP8Xc0hO0U  IQC - 3.1 - Lessons Learned: About Clients (19...
24  https://youtu.be/mb7p4B2spP0  IQC - 3.2 - Lessons Learned: About Consultants...
25  https://youtu.be/KG4SBzXccEk  IQC - 4.1 - The Basics of Business Development...
26  https://youtu.be/A9Wh4L7ZJr0            IQC - 4.2 - Clients and Choices (04:00)
27  https://youtu.be/CL_cVCZ5l7Q                 IQC - 4.3 - Building Trust (10:33)
28  https://youtu.be/yxP4Nz09rSE                IQC - 4.4 - Improving Trust (09:04)

References

[352]
S. Munzert, C. Rubba, P. Meiner, and D. Nyhuis, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, 2nd ed. Wiley Publishing, 2015.