17.2 Web Technologies Basics
Online data can be found in text, tables, lists, links, and other structures, but the way data is presented in browsers is not necessarily how it is stored in HTML/XML.
For instance, consider the NHL’s Atlantic Division standings on 20-Mar-2018 below.
This table is human-readable: most people familiar with professional competitions can recognize what it “means”, even if they know very little about hockey or the National Hockey League.
In the browser, however, this is not how the information is found (see Figure 17.4).
Furthermore, when web pages are dynamic, there is a “cost” associated with automated collection.
There are three areas of importance for data collection on the web:
technologies for content dissemination (HTTP, HTML/XML, JSON, plain text, etc.);
technologies for information extraction (
Python, XPath, JSON parsers, Beautiful Soup, Selenium, regexps, etc.), and
technologies for data storage (
Python, SQL, binary formats, plain text formats, etc.).
17.2.1 Content Dissemination
The information that web scrapers look for on webpages appears in one of the following formats:
- HTML – Hypertext Markup Language
is used to display information on the web; it is not a dedicated data storage format, but it typically contains the information of interest; HTML is interpreted and transformed into “pretty” output by browsers (using CSS);
- XML – Extensible Markup Language
is a popular format for exchanging data over the web; its main purpose is to store data; XML is data wrapped in user-defined tags and as such is more flexible for storing data than HTML is;
is another data storage and exchange format; it is compatible with many programming languages and software; it is easier to parse than HTML or XML, and there is no need to use a specific query language (high level
Ris usually sufficient);
is a group of technologies that enables websites to request data in the background of the browser session and update its visual appearance in a dynamic fashion, while allowing navigation to proceed when waiting for server reply (this can be a nuisance for web scrapers).
17.2.2 Hyper Text Transfer Protocol
Hypertext Transfer Protocol (HTTP) is a message language used between web browsers and web servers; Hypertext Transfer Protocol Secure (HTTPS) combines HTTP with SSL (encryption) and TLS (authentication) protocols.
In a nutshell, when we type in a URL in a browser to access a web page, the browser sends an HTTP request to the underlying server.
A request is made up of a verb, a path, a list of headers, and possibly some parameters. Common verbs include: GET (click on a link) and PUT (fill-out a form & submit).
For instance, if we type
http://www.yahoo.com/search into the browser, a GET request is sent by the browswer to the
yahoo.com server, together with the path
The web server then sends a response to the browser, containing a code (
200, etc.), as well as headers and content.
For instance, the
200 response means that the request was successful: the browser reads the content and uses it (and CSS files) to display the web page.
17.2.3 Web Content
Webpage content itself comes into three main types:
Hypertext Markup Language and variants (HTML/XML) is used for web content and code;
Cascading Style Sheets (CSS) is used to define the webpage style, and
Understanding the tree structure of HTML documents goes a long way towards helping analysts make full use of the scraping toolbox.
CSS defines the colour schemes, the fonts, spacing, and so on. It operates basically as a PowerPoint template would. In the absence of a CSS file, the browser uses a default style to render the webpage.
JS, on the other hand, is a programming language. After the browser parses and displays the HTML file, it executes any JS files referenced in the HTML. JS can be used to manipulate most things on the page (delete/add/change content, change CSS, fetch more files from server, go to new page, etc.), and it can set up actions that run as a result of page events (clicking a button, typing in a text box, etc.)
HTML syntax is fairly straightforward. HTML is a document language based on tags. Tags either come in pairs:
<b>...</b>(bold face text),
or as stand alone singletons:
Paired tags are nested:
<em><strong>...</strong></em>is acceptable, whereas
An HTML file is a tree of tags, also known as elements or nodes.
Tags consist of a name/type (mandatory) and attributes (optional): the tag
<p lang="en">...</p>, for instance, is of type
p (paragraph), and it has a single attribute:
Plain text is allowed inside tags:
Beyond this, the only other thing left to learn is the set of possible tags, and the set of possible attributes. The list is extensive; information can be found at .
Two attributes are particularly important for web scraping:
iduniquely identifies an element:
<a id="saleInfo">...</a>, etc.;
classcan contain multiple values, separated by spaces and is not unique, but it identifies a set of elements:
<h1 class="lightBackground oddPage">...</h1>, etc.
17.2.5 Cookies and Other Headers
We discuss briefly three common headers:
a cookie is a string that is sent and received with HTTP/HTTPS; it allows servers to keep track of user sessions. Upon logging on to a website, users receive a cookie. If the cookie is included in future requests, the user (and its preferences and choices) is recognized by the server; otherwise, the website acts as though the user has logged out.
user agent contain the name and the version of the user’s browser.
referrer sends the page URL from which the request was initiated; if the user is on Page A and clicks a link to Page B, the server for Page B will see that the user came from Page A.