17.1 Data Analysis and Web Scraping

Although analysts should always endeavour to work with representative and unbiased data, there will be times when the available data is flawed and not easily repaired.

We have a professional responsibility to explore the data, looking for potential fatal flaws prior to the start of the analysis and to inform clients and stakeholders of any findings that could halt, skew, or simply hinder the analytical process or its applicability to the situation at hand.332

We might also be called upon to provide suggestions to evaluate or fix the data collection system. The following items could help with that task.

Data Validity

the system must collect the data in such a way that data validity is ensured during initial collection. In particular, data must be collected in a way that ensures sufficient accuracy and precision of the data, relative to its intended use;

Data Granularity, Scale of Data

the system must collect the data at a level of granularity appropriate for future analysis;

Data Coverage

the system must collect data that comprehensively, rather than only partially or unevenly, represents the objects of interest; the system must collect and store the required data over a sufficient amount of time, and at the required intervals, to support data analyses that require data spanning a certain duration;

Data Storage

the system must have the functionality to store the types and amount of data required for a particular analysis;

Data Accessibility

the system must provide access to the data relevant for a particular analysis, in a format that is appropriate for this analysis;

Computational/Analytic Functionality

the system must have the ability to carry out the computations required by relevant data analysis techniques;

Reporting, Dashboard, Visualization

the system must be able to present the results of the data analysis in a meaningful, usable and responsive fashion.

 
A number of different overarching strategies for data collection can be employed. Each of these different strategies will be more or less appropriate under certain data collection circumstances, and will result in different system functional requirements; this is partly why analysts must take the time to understand their systems before embarking on data analysis (see Data Science Basics for details).

World Wide Web

It has been said that the streets of the Web are paved with data that cannot wait to be collected, but you might be surprised to discover how much of that is “trash”.

The way we share, collect, and publish data has changed over the past few years due to the ubiquity of the World Wide Web. Private businesses, governments, and individual users are posting and sharing all kinds of data and information. At every moment, new channels generate vast amounts of data.

There was a time in the recent past where both scarcity and inaccessibility of data was a problem for researchers and decision-makers. That is emphatically not the case anymore. Data abundance carries its own set of problems, however, in the form of

  • tangled masses of data, and

  • traditional data collection methods and classical data analysis techniques not being up to the task anymore (which is not to say that the results they would give would be incorrect; it’s rather their lack of efficiency that comes into play).

 
The growth and increasing popularity and power of open source software, such as R and Python, for which the source code can be inspected, modified, and enhanced by anyone, makes program-based automated data collection quite appealing.

One note of warning, however: time marches on and packages become obsolete in the blink of an eye. If the analyst is unable (or unwilling) to maintain their extraction/analysis code and to monitor the sites from which the data is extracted, the choice of software will not make much of a difference.

17.1.1 What and Why of Web Scraping

So why bother with automated data collection? Common considerations include:

  • the sparsity of financial resources;

  • the lack of time or desire to collect data manually;

  • the desire to work with up-to-date, high-quality data-rich sources, and

  • the need to document the analytical process from beginning (data collection) to end (publication).

 
Manual collection, on the other hand, tends to be cumbersome and prone to error; non-reproducible processes are also subject to heightened risks of “death by boredom”, whereas program-based solutions are typically more reliable, reproducible, time-efficient, and produce datasets of higher quality (this assumes, of course, that coherently presented data exists in the first place).

Automated Data Checklist

That being said, web scraping is not always recommended. As a starting point, it is possible that no online and freely available source of data meets the analysis’ needs, in which case an approach based on survey sampling is preferable, in all likelihood.

If most of the answers to the following questions are positive, however, then an automated approach may be the right choice:

  • is there a need to repeat the task from time to time (e.g. to update a database)?

  • is there a need for others to be able to replicate the data collection process?

  • are online sources of data frequently used?

  • is the task non-trivial in terms of scope and complexity?

  • if the task can be done manually, are the financial resources required to let others do the work lacking?

  • is the will to automate the process by means of programming there?

 
The objective is simple: automatic data collection should yield a collection of unstructured or unsorted datasets, at a reasonable cost.

17.1.2 Web Data Quality

Data quality issues are inescapable. It is not rare for stakeholders or clients to have spent thousands of dollars on data collection (automatic or manual) and to respond to the news that the data is flawed or otherwise unusable with: “well, it’s the best data we have, so find a way to use it.”

These issues can be side-stepped to some extent if consultants get involved in the project during or prior to the data collection stage, asking questions such as:

  • what type of data is best-suited to answer the client’s question(s)?

  • is the available data of sufficiently high quality to answer the client’s question(s)?

  • is the available information systematically flawed?

 
Web data can be first-hand information (a tweet or a news article), or second-hand (copied from an offline source or scraped from some online location, which may make it difficult to retrace).

Cross-referencing is a standard practice when dealing with secondary data. Data quality also depends on its use(s) and purpose(s). For example, a sample of tweets collected on a random day could be used to analyse the use of a hashtags or the gender-specific use of words, but that dataset might not prove as useful if it had been collected on the day of the 2016 U.S. Presidential Election to predict the election outcomes (due to collection bias).

An example might help to illustrate some the pitfalls and challenges. Let’s say that a client is interested in using a standard telephone survey to find out what people think of a new potato peeler.

Such an approach has a number of pitfalls:

  • unrepresentative sample – the selected sample might not represent the intended population;

  • systematic non-response – people who do not like phone surveys might be less (or more) likely to dislike the new potato peeler;

  • coverage error – people without a landline cannot be reached, say, and

  • measurement error – are the survey questions providing suitable info for the problem at hand?

 
Traditional solutions to these problems require the use of survey sampling, questionnaire design, omnibus surveys, reward systems, audits, etc.

These solutions can be costly, time-consuming, and ineffective. Proxies – indicators that are strongly related to the product’s popularity without measuring it directly, could be used instead.

If popularity is defined as large groups of people preferring a potato peeler over another one, then sales statistics on a commercial website may provide a proxy for popularity. Rankings on Amazon.ca (or a similar website) could, in fact, paint a more comprehensive portrait of the potato peeler market than would a traditional survey.

It could suffice, then, to build a scraper that is compatible with Amazon’s application program interface (API) to gather the appropriate data.

Of course, there are potential issues with this approach as well:

  • representativeness of the listed products – are all potato peelers listed? If not, is it because that website does not sell them or is there some other reason?

  • representativeness of the customers – are there specific groups buying/not-buying online products? Are there specific groups buying from specific sites? Are there specific groups leaving/not-leaving reviews?

  • truthfulness of customers and reliability of reviews – how can we distinguish between paid (fake) reviews and real reviews?

 
Web scraping is usually well-suited for collecting data on products (such as the aforementioned potato-peeler), but there are numerous questions for which it is substantially more difficult to imagine where data could be found online: what data could be collected online to measure the popularity of a government policy, say?

17.1.3 Ethical Considerations

We now turn our attention to a burning question for consultants and analysts alike: is all the freely available data on the Internet ACTUALLY freely available?

A spider is a program that grazes or crawls the web rapidly, looking for information. It jumps from one page to another, grabbing the entire page content. Scraping, on the other hand, is defined as taking specific information from specific websites: how are these different?

“Scraping inherently involves copying, and therefore one of the most obvious claims against scrapers is copyright infringement.” [352]

What can be done to minimize the risk? Analysts should:

  • work as transparently as possible;

  • document data sources at all time;

  • give credit to those who originally collected and published the data;

  • keep in mind that if someone else collected the data, permission is probably required to reproduce it reproduce it, and, more importantly,

  • not do anything illegal.

 
A number of cases have shown that the courts have not yet found their footing in this matter (see eBay vs. Bidder’s Edge, Associated Press vs. Meltwater, Facebook vs. Pete Warden, United States vs. Aaron Swartz, for instance [353]).

There are legal issues that we are not qualified to discuss, but in general, it seems as though larger companies/organisations usually emerge victorious from such battles.

Part of the difficulty is that it is not clear which scraping actions are illegal and which are legal, but there are rough guidelines: re-publishing content for commercial purposes is considered more problematic than downloading pages for research/analysis, say.

A site’s robots.txt (Robots Exclusion Protocol) file tells scrapers what information on the site may be harvested with the publisher’s consent – analysts must heed that file (see Figure 17.1 for examples of such files).

The Robots Exclusion Protocol file for [`cqads.carleton.ca`](https://cqads.carleton.ca/robots.txt)`, [`theweathernetwork.com`](https://www.theweathernetwork.com/robots.txt), and [`cfl.ca`](https://www.cfl.ca/robots.txt).

Figure 17.1: The Robots Exclusion Protocol file for cqads.carleton.ca, [theweathernetwork.com](https://www.theweathernetwork.com/robots.txt), and [cfl.ca`](https://www.cfl.ca/robots.txt).

Perhaps more importantly, be friendly! Not everything that can be scraped needs to be scraped. Scraping programs should

  1. behave “nicely”;

  2. provide useful data, and

  3. be efficient, in that order.

 
Any data accessed by HTTP forms is stored in some sort of database. When in doubt, contact the data provider to see if they will grant access to the databases or files. The larger the amount of data you want, the better it is for both parties to communicate before starting to harvest data (for small amounts of data, that may be less important, but small for one does not necessarily mean small for all).

Finally, note the importance of following the Scraping Do’s and Don’t’s:

  1. stay identifiable;

  2. reduce traffic – accept compressed files, check that a file has been changed before accessing it again, retrieve only parts of a file;

  3. do not bother server with multiple requests – many requests per second can bring smaller server downs, webmasters may block a scraper if it is too greedy (a few requests per second is fine), and

  4. write efficient and polite scrapers – there is no reason to scrape pages daily or to repeat the same task over and over, select specific resources and leave the rest untouched.

 
Webpage designs tend to change quickly and often. A broken scraper will still consume bandwidth, however, without payoff.

This is all put together in an etiquette flow diagram (or perhaps “ethiquette”?) provided by [352] (see Figure 17.2 below).

Etiquette flow diagram for web scraping. [@Munzert]

Figure 17.2: Etiquette flow diagram for web scraping. [352]

17.1.4 Automated Data Collection Decision Process

Let us end this section by providing a short summary of the automated data collection decision process [352], [353], from the point of view of analysts or quantitative consultants:

  1. Know exactly what kind of information the client needs, either specific (e.g. GDP of all OECD countries for last 10 years, sales of top 10 tea brands in 2017, etc.) or vague (people’s opinion on tea brand \(X\), etc.)

  2. Find out if there are any web data sources that could provide direct or indirect information on the client’s problem. That is easier to achieve for specific facts (a tea store’s webpage will provide information about teas that are currently in demand) than it is for vague facts (where would one find opinions on a collection of tea brands?). Tweets and social media platforms may contain opinion trends; commercial platforms can provide information on product satisfaction.

  3. Develop a theory of the data generation process when looking into potential data sources. When was the data generated? When was it uploaded to the Web? Who uploaded the data? Are there any potential areas that are not covered, consistent, or accurate? How often is the data updated?

  4. Balance the advantages and disadvantages of potential data sources. Validate the quality of data used – are there other independent sources that provide similar information against which to crosscheck? Can original source of secondary data be identified?

  5. Make a data collection decision. Choose the data sources that seem most suitable, and document reasons for this decision. Collect data from several sources to validate the final choice.

References

[352]
S. Munzert, C. Rubba, P. Meiner, and D. Nyhuis, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, 2nd ed. Wiley Publishing, 2015.
[353]
R. Mitchell, Web Scraping with Python: Collecting Data From the Modern Web, 2nd ed. O’Reilly Media, 2018.