17.2 Web Technologies Basics

Online data can be found in text, tables, lists, links, and other structures, but the way data is presented in browsers is not necessarily how it is stored in HTML/XML.

For instance, consider the NHL’s Atlantic Division standings on 20-Mar-2018 below.

NHL's Atlantic Division standings on 20-Mar-2018 [nhl.com]

Figure 17.3: NHL’s Atlantic Division standings on 20-Mar-2018 [nhl.com]

This table is human-readable: most people familiar with professional competitions can recognize what it “means”, even if they know very little about hockey or the National Hockey League.

In the browser, however, this is not how the information is found (see Figure 17.4).

NHL's Atlantic Division standings on 20-Mar-2018, under the hood [nhl.com]

Figure 17.4: NHL’s Atlantic Division standings on 20-Mar-2018, under the hood [nhl.com]

Furthermore, when web pages are dynamic, there is a “cost” associated with automated collection.

Consequently, a basic knowledge of the web and web-related techs and documents is crucial. Information is readily available online and in [352], [353].

There are three areas of importance for data collection on the web:

  • technologies for content dissemination (HTTP, HTML/XML, JSON, plain text, etc.);

  • technologies for information extraction (R, Python, XPath, JSON parsers, Beautiful Soup, Selenium, regexps, etc.), and

  • technologies for data storage (R, Python, SQL, binary formats, plain text formats, etc.).

17.2.1 Content Dissemination

The information that web scrapers look for on webpages appears in one of the following formats:

HTML – Hypertext Markup Language

is used to display information on the web; it is not a dedicated data storage format, but it typically contains the information of interest; HTML is interpreted and transformed into “pretty” output by browsers (using CSS);

XML – Extensible Markup Language

is a popular format for exchanging data over the web; its main purpose is to store data; XML is data wrapped in user-defined tags and as such is more flexible for storing data than HTML is;

JSON – JavaScript Object Notation

is another data storage and exchange format; it is compatible with many programming languages and software; it is easier to parse than HTML or XML, and there is no need to use a specific query language (high level R is usually sufficient);

AJAX – Asynchronous JavaScript and XML

is a group of technologies that enables websites to request data in the background of the browser session and update its visual appearance in a dynamic fashion, while allowing navigation to proceed when waiting for server reply (this can be a nuisance for web scrapers).

Comparison between HTML and XML code (left) and between JSON and XML code (right). [e-cartouche.ch, activeVOS.com]

Figure 17.5: Comparison between HTML and XML code (left) and between JSON and XML code (right). [e-cartouche.ch, activeVOS.com]

17.2.2 Hyper Text Transfer Protocol

Hypertext Transfer Protocol (HTTP) is a message language used between web browsers and web servers; Hypertext Transfer Protocol Secure (HTTPS) combines HTTP with SSL (encryption) and TLS (authentication) protocols.

In a nutshell, when we type in a URL in a browser to access a web page, the browser sends an HTTP request to the underlying server.

A request is made up of a verb, a path, a list of headers, and possibly some parameters. Common verbs include: GET (click on a link) and PUT (fill-out a form & submit).

For instance, if we type http://www.yahoo.com/search into the browser, a GET request is sent by the browswer to the yahoo.com server, together with the path /search.

The web server then sends a response to the browser, containing a code (404, 200, etc.), as well as headers and content.

For instance, the 200 response means that the request was successful: the browser reads the content and uses it (and CSS files) to display the web page.

Schematics of HTTP (top) and AJAX (bottom) requests; a new HTTP request refreshes the entire page, a new AJAX request only refreshes the data [javabelazy.blogspot.ca]

Figure 17.6: Schematics of HTTP (top) and AJAX (bottom) requests; a new HTTP request refreshes the entire page, a new AJAX request only refreshes the data [javabelazy.blogspot.ca]

17.2.3 Web Content

Webpage content itself comes into three main types:

  • Hypertext Markup Language and variants (HTML/XML) is used for web content and code;

  • Cascading Style Sheets (CSS) is used to define the webpage style, and

  • JavaScript (JS) is used to provide webpage interactivity.

 
HTML is, in some sense, the most fundamental (the other two are optional); HTML is a document language, like \(\LaTeX\) or markdown (on which this book is based). A fresh HTTP/HTTPS request for a page usually returns an HTML file, which may contain references to additional server files (CSS, JavaScript, images, etc.) – the browser makes additional requests for these when the webpage is rendered.

Understanding the tree structure of HTML documents goes a long way towards helping analysts make full use of the scraping toolbox.

CSS defines the colour schemes, the fonts, spacing, and so on. It operates basically as a PowerPoint template would. In the absence of a CSS file, the browser uses a default style to render the webpage.

JS, on the other hand, is a programming language. After the browser parses and displays the HTML file, it executes any JS files referenced in the HTML. JS can be used to manipulate most things on the page (delete/add/change content, change CSS, fetch more files from server, go to new page, etc.), and it can set up actions that run as a result of page events (clicking a button, typing in a text box, etc.)

17.2.4 HTML/XML

HTML syntax is fairly straightforward. HTML is a document language based on tags. Tags either come in pairs:

  • <title>...</title> (self-explanatory),

  • <b>...</b> (bold face text),

  • etc.,

or as stand alone singletons:

  • <br> (linebreak),

  • <hr> (horizontal rule),

  • etc.

Paired tags are nested:

  • <em><strong>...</strong></em> is acceptable, whereas

  • <em><strong>...</em></strong> is not.

 
An HTML file is a tree of tags, also known as elements or nodes.

Tags consist of a name/type (mandatory) and attributes (optional): the tag <p lang="en">...</p>, for instance, is of type p (paragraph), and it has a single attribute: lang="en".

Plain text is allowed inside tags: <span>Hello World!</span>.

Beyond this, the only other thing left to learn is the set of possible tags, and the set of possible attributes. The list is extensive; information can be found at [354].

Two attributes are particularly important for web scraping:

  • id uniquely identifies an element: <a id="product">...</a>, <a id="saleInfo">...</a>, etc.;

  • class can contain multiple values, separated by spaces and is not unique, but it identifies a set of elements: <h1 class="lightBackground oddPage">...</h1>, etc.

17.2.5 Cookies and Other Headers

We discuss briefly three common headers:

  • a cookie is a string that is sent and received with HTTP/HTTPS; it allows servers to keep track of user sessions. Upon logging on to a website, users receive a cookie. If the cookie is included in future requests, the user (and its preferences and choices) is recognized by the server; otherwise, the website acts as though the user has logged out.

  • user agent contain the name and the version of the user’s browser.

  • referrer sends the page URL from which the request was initiated; if the user is on Page A and clicks a link to Page B, the server for Page B will see that the user came from Page A.

References

[352]
S. Munzert, C. Rubba, P. Meiner, and D. Nyhuis, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, 2nd ed. Wiley Publishing, 2015.
[353]
R. Mitchell, Web Scraping with Python: Collecting Data From the Modern Web, 2nd ed. O’Reilly Media, 2018.
[354]