Module 17 Web Scraping and Automated Data Collection

by Patrick Boily, with contributions from Andrew Macfie and Lani Haque

Data analysis tools and techniques work in conjunction with collected data. The type of data that needs to be collected to carry out such analyses, as well as the priority placed on the collection of quality data relative to other demands, dictate the choice of data collection strategies. The manner in which the resulting outputs of these analyses are used for decision support will, in turn, influence appropriate data presentation strategies and system functionality.

We have already discussed how data can be processed and transformed to make it more suitable for analysis (see Section 8), and how questionnaire design and probabilistic sampling can be used to obtain representative datasets (see Section 5); in this module we explore the technical aspects of automated data collection and web scraping, as well as the many ways in which this activity can go awry.

Some of the material of this module is modified, in part, from [352], [353].


17.1 Data Analysis and Web Scraping
     17.1.1 What and Why of Web Scraping
     17.1.2 Web Data Quality
     17.1.3 Ethical Considerations
     17.1.4 Decision Process

17.2 Web Technologies Basics
     17.2.1 Content Dissemination
     17.2.2 Hyper Text Transfer Protocol
     17.2.3 Web Content
     17.2.4 HTML/XML
     17.2.5 Cookies and Other Headers

17.3 Scraping Toolbox
     17.3.1 Developer Tools
     17.3.2 XPath
     17.3.3 Regular Expressions
     17.3.4 Beautiful Soup
     17.3.5 Selenium
     17.3.6 APIs
     17.3.7 Specialized Uses and Applications

17.4 Examples
     17.4.1 Wikipedia
     17.4.2 Weather Data
     17.4.3 CFL Play-by-Play
     17.4.4 Bad HTML
     17.4.5 Extracting Text from a PDF File
     17.4.6 YouTube Titles

17.5 Exercises


S. Munzert, C. Rubba, P. Meiner, and D. Nyhuis, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, 2nd ed. Wiley Publishing, 2015.
R. Mitchell, Web Scraping with Python: Collecting Data From the Modern Web, 2nd ed. O’Reilly Media, 2018.