17.5 Exercises

In these exercises, use R’s rvest, Python’s Beautiful Soup, or any other tool (whether we discussed it or not in the main text) that will allow you to complete the task. You may need to look up various tutorials and examples, and consult documentation, Stack Overflow, and so on.

  1. Web data is available from a variety of sources, in a variety of formats and languages. Your job is to build a collection of 5 text corpora, each one consisting of documents written in a different language (English, French, Spanish, Italian, and Other). The text documents will be collected from the New Zealand Government’s press releases, from Wikipedia, from twitter, from a PDF document, and from other sources. Your final dataset will consist of all of the observations (text) placed in rows, each row associated with a specific language code (“Eng”, “Fra”, “Esp”, “Ita”, “Oth”).

    1. English: the text of all Canadian government press releases published in 2020.

    2. French: the text from the (French) Wikipedia entries of all French actresses whose last name starts with “L”.

    3. Spanish: 700 tweets (total) from @realmadrid, @PaulinaRubio, @Armada_esp + 2 other tweeters of your choice.

    4. Italian: the text from Giovannino Guareschi’s Tutto don Camillo (I racconti del Mondo piccolo) – Volume 1 di 5 (PDF), 1 page per row.

    5. Other: 500 other text documents, in other languages that use a Latin-based alphabet.

  2. Build a scraper that automatically collects a multiple-day forecast for all Canadian cities in the database (not only those found on the landing page), independently of the time at which the scraping takes place.

  3. Consider the parsed_doc object from the XPath section. What do you think the following blocks of code do?

lowerCaseFun <- function(x) {
  x <- tolower(xmlValue(x))
  return(x)
}

XML::xpathSApply(parsed_doc, "//div//i", fun = lowerCaseFun)
dateFun <- function(x) {
  require(stringr)
  date <- xmlGetAttr(node = x, name = "date")
  year <- str_extract(date, "[0-9]{4}")
  return(year)
  }

XML::xpathSApply(parsed_doc, "//div", dateFun)    
dateFun <- function(x) {
  require(stringr)
  date <- xmlGetAttr(node = x, name = "date")
  year <- str_extract(date, "[0-9]{4}")
  return(year)
  }

XML::xpathSApply(parsed_doc, "//div", dateFun)    
  1. In the CFL example, the play-by-play data is in separate tables for each quarter. Write a routine that grabs the information and produces a Pandas dataframe for each quarter, with the following headers: ID, away, details, down, home, quarter, time, type, and yard.

  2. Use Zomato to find which Canadian city has the best sushi restaurants.

  3. Modify the YouTube example in order to extract the videos’ captions. Clean them using Beautiful Soup.

  4. Use twitteR (or other packages) to build a data frame of tweets related to the Marvel Cinematic Universe. Do your tweets mostly originate from Android or iPhone devices? Plot the frequency of tweets against time. Do the same for retweets. Do any patterns emerge?

  5. Collect all Canadian government press releases for the 2021 calendar year. Identify the date, emanating Department(s), and the number of characters in each release. Are there Departments who release news more frequently than others? Are there Departments whose releases are typically longer than average? What other insights can you draw from your data frame? Repeat this process with French-language press releases.

  6. Produce a data frame listing all new products available at David’s Tea, the page number where the product was listed, and its price.