17.3 Scraping Toolbox

From experience, we know that a number of tools can facilitate the automated data extraction process, including:

  • Developer Tools,

  • XPath,

  • regular expressions,

  • Beautiful Soup, and

  • Selenium.

17.3.1 Developer Tools

Developer Tools allow us to see the correspondence between the HTML code for a page and the rendered version seen in the browser, as illustrated in Figure 17.7.

Inspecting [Nice Peter](https://nicepeter.com/erb)’s website's elements using Chrome’s Developer Tools.

Figure 17.7: Inspecting Nice Peter’s website’s elements using Chrome’s Developer Tools.

Unlike “View Source”, Developer Tools show the dynamic version of the HTML content (i.e. the HTML is shown with any changes made by JavaScript since the page was first received). Inspecting a page’s various elements and discovering where they reside in the HTML file is crucial to efficient web scraping:

  • Firefox – right click page \(\to\) Inspect Element

  • Safari – Safari \(\to\) Preferences \(\to\) Advanced \(\to\) Show Develop Menu in Menu Bar, then Develop \(\to\) Show Web Inspector

  • Chrome – right click page \(\to\) Inspect

17.3.2 XPath

XPath is a query (domain-specific) language which is used to select specific pieces of information from marked-up documents such as HTML, XML, or variants such as SVG, RSS. Before this can be done, the information stored in a marked-up document needs to be converted (or parsed) into a format suitable for processing and statistical analysis; this is implemented in the R package XML, for instance.

The process is simple; it involves

  1. specifying the data of interest;

  2. locating it in a specific document, and

  3. tailoring a query to the document to extract the desired info.

 
HTML/XML tags have attributes and values. HTML files must be parsed before they can be queried by XPath. XPath queries require both a path and a document to search; paths consist of hierarchical addressing mechanism (succession of nodes, separated by forward slashes (“/”), while a query takes the form xpathSApply(doc,path).

xpathSApply(parsed_doc,“/html/body/div/p/i”), for instance, would find all <i> tags found under a <p> tag, itself found under a <div> tag in the body of the html file of parsed_doc. Consult [352] for a substantially heftier introduction.

We will illustrate Xpath’s functionality with the help of the following webpage:

A simple HTML document, rendered in a browser, based on [@MJones].

Figure 17.8: A simple HTML document, rendered in a browser, based on [355].

The underlying HTML code is in the file laws.html; we parse the document using XML’s htmlParse().

#library(XML)
parsed_doc <- XML::htmlParse(file = "Data/laws.html")
print(parsed_doc)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head><title>Laws of the Internet</title></head>
<!-- From M. Jones' 15 Fundamental Laws of the Internet --><body>
<h1>Laws of the <i>Internet</i>
</h1>
<div id="wiio" lang="english" date="1978">
  <h2>Osmo Antero Wiio</h2>
  <p><i>Communication usually fails, except by accident.</i></p>
  <p><b>Source: </b>Wiion lait - ja vähän muidenkin</p>
</div>

<div lang="english" date="1986">
  <h2>Melvin Kranzberg</h2>
  <p><i>Technology is neither good nor bad; nor is it neutral.</i> <br><emph>(Kranzberg's 1st Law)</emph></p>
  <p><b>Source: </b><a href="https://www.jstor.org/stable/3105385">Technology and Culture. 27 (3): 544–560.</a></p>
</div>

<div lang="english" date="1958">
  <h2>Theodore Sturgeon</h2>
  <p><i>90% of everything is crap.</i> <br><emph>(Sturgeon's Revelation)</emph></p>
  <p><b>Source: </b>"Books: On Hand". Venture Science Fiction. Vol. 2, no. 2. p. 66.</p>
</div>

<div id="other">
<h2>Others:</h2>
<ul>
<li>The 1% Rule: "Only 1% of the users of a website actively create new content, while the other 99% of the participants only lurk."</li>
<li>D!@kwad Theory: "Normal Person + Anonymity + Audience = Total D!@kwad"</li>
<li>Godwin's Law: "As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches one."</li>
<li>Poe's Law: "Without a clear indicator of the author's intent, parodies of extreme views will be mistaken by some readers or viewers as sincere expressions of the parodied views."</li>
<li>Skitt's Law: "Any post correcting an error in another post will contain at least one error itself."</li>
<li>Law of Exclamation: "The more exclamation points used in an email (or other posting), the more likely it is a complete lie."</li>
<li>Cunningham's Law: "The best way to get the right answer on the Internet is not to ask a question, it's to post the wrong answer."</li>
<li>The Wiki Rule: "There's a wiki for that."</li>
<li>Danth's Law: "If you have to insist that you've won an Internet argument, you've probably lost badly."</li>
<li>Law of the Echo Chamber: "If you feel comfortable enough to post an opinion of any importance on any given Internet site, you are most likely delivering that opinion to people who already agree with you."</li>
<li>Munroe's Law: "You will never change anyone's opinion on anything by making a post on the Internet. This will not stop you from trying."</li>
</ul>
</div>

<address>
<a href="https://exceptionnotfound.net/15-fundamental-laws-of-the-internet/"><i>15 Fundamental Laws of the Internet</i></a>, by Matthew Jones<a></a>
</address>

</body>
</html>
 

Basic Structural Queries

XPath queries are called using xpathSApply(), which requires a parsed document doc and a query path path.

It is much easier to determine the required query paths if we have some idea of the structure of the underlying HTML document tree (see Figure 17.9 for an example).

The HTML document tree for the built-in `fortunes.html` file  [@Munzert].

Figure 17.9: The HTML document tree for the built-in fortunes.html file [352].

 
Absolute paths are represented by single forward slashes [/].

XML::xpathSApply(doc = parsed_doc, path = "/html/body/div/p/i")
[[1]]
<i>Communication usually fails, except by accident.</i> 

[[2]]
<i>Technology is neither good nor bad; nor is it neutral.</i> 

[[3]]
<i>90% of everything is crap.</i> 

Relative paths are represented by double forward slases [//].

XML::xpathSApply(parsed_doc, "//body//p/i")
[[1]]
<i>Communication usually fails, except by accident.</i> 

[[2]]
<i>Technology is neither good nor bad; nor is it neutral.</i> 

[[3]]
<i>90% of everything is crap.</i> 
XML::xpathSApply(parsed_doc, "//p/i")
[[1]]
<i>Communication usually fails, except by accident.</i> 

[[2]]
<i>Technology is neither good nor bad; nor is it neutral.</i> 

[[3]]
<i>90% of everything is crap.</i> 

Wildcards are represented by an asterisk [*].

XML::xpathSApply(parsed_doc, "/html/body/div/*/i")
[[1]]
<i>Communication usually fails, except by accident.</i> 

[[2]]
<i>Technology is neither good nor bad; nor is it neutral.</i> 

[[3]]
<i>90% of everything is crap.</i> 

Going up one level in the parsed tree is represented by a double dot [..].

XML::xpathSApply(parsed_doc, "//title/..")
[[1]]
<head>
  <title>Laws of the Internet</title>
</head> 

The disjunction (OR) of two paths is represented by the operator [|].

XML::xpathSApply(parsed_doc, "//address | //title")
[[1]]
<title>Laws of the Internet</title> 

[[2]]
<address>
<a href="https://exceptionnotfound.net/15-fundamental-laws-of-the-internet/"><i>15 Fundamental Laws of the Internet</i></a>, by Matthew Jones<a/>
</address> 

We can also concatenate multiple queries.

twoQueries <- c(address = "//address", title = "//title")
XML::xpathSApply(parsed_doc, twoQueries)
[[1]]
<title>Laws of the Internet</title> 

[[2]]
<address>
<a href="https://exceptionnotfound.net/15-fundamental-laws-of-the-internet/"><i>15 Fundamental Laws of the Internet</i></a>, by Matthew Jones<a/>
</address> 

Note, however, that absolute (or even relative) paths cannot always succinctly select nodes in large or complicated files.

Node Relations

A query’s path can also exploit a node’s relation to other nodes. By analogy with a family tree, a node’s placement in the parsed tree often mimics the relations in extended families.

Relations are denoted according to node1/relation::node2. For instance:

  • "//a/ancestor::div" returns all <div> nodes that are an ancestor to an <a> node;

  • "//a/ancestor::div//i" returns all <i> nodes contained in a <div> node that is an ancestor to an <a> node etc.

Generic node relations [@Munzert].

Figure 17.10: Generic node relations [352].

 
The following XPath query looks for <a> tags in the document, and produces their ancestors <div> tag (there is only one of each in this example).

XML::xpathSApply(parsed_doc, "//a/ancestor::div")
[[1]]
<div lang="english" date="1986">
  <h2>Melvin Kranzberg</h2>
  <p><i>Technology is neither good nor bad; nor is it neutral.</i> <br/><emph>(Kranzberg's 1st Law)</emph></p>
  <p><b>Source: </b><a href="https://www.jstor.org/stable/3105385">Technology and Culture. 27 (3): 544–560.</a></p>
</div> 

The following XPath query looks for <a> tags in the document, and produces all <i> tags of their ancestors <div> tag (there is only one in this example).

XML::xpathSApply(parsed_doc, "//a/ancestor::div//i")
[[1]]
<i>Technology is neither good nor bad; nor is it neutral.</i> 

The following XPath query looks for <p> tags in the document, and produces the <h2> tags of all their preceding-sibling nodes (there are three in this example).

XML::xpathSApply(parsed_doc, "//p/preceding-sibling::h2")
[[1]]
<h2>Osmo Antero Wiio</h2> 

[[2]]
<h2>Melvin Kranzberg</h2> 

[[3]]
<h2>Theodore Sturgeon</h2> 

What do you think this query will do?

XML::xpathSApply(parsed_doc, "//title/parent::*")

XPath Predicates

A predicate is a function that applies to a node’s name, value, or attributes and that returns a logical TRUE or FALSE. Predicates modify the path input of an XPath query: the query selects the nodes for which the relation holds.

Predicates are denoted by square brackets, placed after a node. For instance:

  • "//p[position()=1]" returns the first <p> node relative to its parent node;

  • "//p[last()]" returns the last <p> node relative to its parent node, and

  • "//div[count(./@*)>2]" returns all <div> nodes with 2+ attributes.

 
This XPath query finds the first <p> node in each <div> node.

XML::xpathSApply(parsed_doc, "//div/p[position()=1]")
[[1]]
<p>
  <i>Communication usually fails, except by accident.</i>
</p> 

[[2]]
<p><i>Technology is neither good nor bad; nor is it neutral.</i> <br/><emph>(Kranzberg's 1st Law)</emph></p> 

[[3]]
<p><i>90% of everything is crap.</i> <br/><emph>(Sturgeon's Revelation)</emph></p> 

This XPath query finds the last <p> node in each <div> node.

XML::xpathSApply(parsed_doc, "//div/p[last()]")
[[1]]
<p><b>Source: </b>Wiion lait - ja vähän muidenkin</p> 

[[2]]
<p>
  <b>Source: </b>
  <a href="https://www.jstor.org/stable/3105385">Technology and Culture. 27 (3): 544–560.</a>
</p> 

[[3]]
<p><b>Source: </b>"Books: On Hand". Venture Science Fiction. Vol. 2, no. 2. p. 66.</p> 

This XPath query finds the second last <p> node in each <div> node.

XML::xpathSApply(parsed_doc, "//div/p[last()-1]")
[[1]]
<p>
  <i>Communication usually fails, except by accident.</i>
</p> 

[[2]]
<p><i>Technology is neither good nor bad; nor is it neutral.</i> <br/><emph>(Kranzberg's 1st Law)</emph></p> 

[[3]]
<p><i>90% of everything is crap.</i> <br/><emph>(Sturgeon's Revelation)</emph></p> 

This XPath query finds the <div> nodes that have at least one <a> node among their children.

XML::xpathSApply(parsed_doc, "//div[count(.//a)>0]")
[[1]]
<div lang="english" date="1986">
  <h2>Melvin Kranzberg</h2>
  <p><i>Technology is neither good nor bad; nor is it neutral.</i> <br/><emph>(Kranzberg's 1st Law)</emph></p>
  <p><b>Source: </b><a href="https://www.jstor.org/stable/3105385">Technology and Culture. 27 (3): 544–560.</a></p>
</div> 

This XPath query finds the <div> nodes that have more than 2 attributes.

XML::xpathSApply(parsed_doc, "//div[count(./@*)>2]")
[[1]]
<div id="wiio" lang="english" date="1978">
  <h2>Osmo Antero Wiio</h2>
  <p><i>Communication usually fails, except by accident.</i></p>
  <p><b>Source: </b>Wiion lait - ja vähän muidenkin</p>
</div> 

This XPath query finds the nodes for which the text component has more than 50 characters.

XML::xpathSApply(parsed_doc, "//*[string-length(text())>50]")
[[1]]
<i>Technology is neither good nor bad; nor is it neutral.</i> 

[[2]]
<p><b>Source: </b>"Books: On Hand". Venture Science Fiction. Vol. 2, no. 2. p. 66.</p> 

[[3]]
<li>The 1% Rule: "Only 1% of the users of a website actively create new content, while the other 99% of the participants only lurk."</li> 

[[4]]
<li>D!@kwad Theory: "Normal Person + Anonymity + Audience = Total D!@kwad"</li> 

[[5]]
<li>Godwin's Law: "As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches one."</li> 

[[6]]
<li>Poe's Law: "Without a clear indicator of the author's intent, parodies of extreme views will be mistaken by some readers or viewers as sincere expressions of the parodied views."</li> 

[[7]]
<li>Skitt's Law: "Any post correcting an error in another post will contain at least one error itself."</li> 

[[8]]
<li>Law of Exclamation: "The more exclamation points used in an email (or other posting), the more likely it is a complete lie."</li> 

[[9]]
<li>Cunningham's Law: "The best way to get the right answer on the Internet is not to ask a question, it's to post the wrong answer."</li> 

[[10]]
<li>Danth's Law: "If you have to insist that you've won an Internet argument, you've probably lost badly."</li> 

[[11]]
<li>Law of the Echo Chamber: "If you feel comfortable enough to post an opinion of any importance on any given Internet site, you are most likely delivering that opinion to people who already agree with you."</li> 

[[12]]
<li>Munroe's Law: "You will never change anyone's opinion on anything by making a post on the Internet. This will not stop you from trying."</li> 

This XPath query finds all <div> nodes with 2 or fewer attributes.

XML::xpathSApply(parsed_doc, "//div[not(count(./@*)>2)]")
[[1]]
<div lang="english" date="1986">
  <h2>Melvin Kranzberg</h2>
  <p><i>Technology is neither good nor bad; nor is it neutral.</i> <br/><emph>(Kranzberg's 1st Law)</emph></p>
  <p><b>Source: </b><a href="https://www.jstor.org/stable/3105385">Technology and Culture. 27 (3): 544–560.</a></p>
</div> 

[[2]]
<div lang="english" date="1958">
  <h2>Theodore Sturgeon</h2>
  <p><i>90% of everything is crap.</i> <br/><emph>(Sturgeon's Revelation)</emph></p>
  <p><b>Source: </b>"Books: On Hand". Venture Science Fiction. Vol. 2, no. 2. p. 66.</p>
</div> 

[[3]]
<div id="other">
<h2>Others:</h2>
<ul><li>The 1% Rule: "Only 1% of the users of a website actively create new content, while the other 99% of the participants only lurk."</li>
<li>D!@kwad Theory: "Normal Person + Anonymity + Audience = Total D!@kwad"</li>
<li>Godwin's Law: "As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches one."</li>
<li>Poe's Law: "Without a clear indicator of the author's intent, parodies of extreme views will be mistaken by some readers or viewers as sincere expressions of the parodied views."</li>
<li>Skitt's Law: "Any post correcting an error in another post will contain at least one error itself."</li>
<li>Law of Exclamation: "The more exclamation points used in an email (or other posting), the more likely it is a complete lie."</li>
<li>Cunningham's Law: "The best way to get the right answer on the Internet is not to ask a question, it's to post the wrong answer."</li>
<li>The Wiki Rule: "There's a wiki for that."</li>
<li>Danth's Law: "If you have to insist that you've won an Internet argument, you've probably lost badly."</li>
<li>Law of the Echo Chamber: "If you feel comfortable enough to post an opinion of any importance on any given Internet site, you are most likely delivering that opinion to people who already agree with you."</li>
<li>Munroe's Law: "You will never change anyone's opinion on anything by making a post on the Internet. This will not stop you from trying."</li>
</ul></div> 

Can you predict what the following queries do? What they will return?

XML::xpathSApply(parsed_doc, "//div[@date='1958']")
XML::xpathSApply(parsed_doc, "//*[contains(text(), '%')]")
XML::xpathSApply(parsed_doc, "//div[starts-with(./@id, 'wiio')]")

A number of commonly-used XPath functions are shown in Figure 17.11.

Commonly-used XPath functions  [@Munzert].

Figure 17.11: Commonly-used XPath functions [352].

Extracting Node Elements

XPath queries can also extract specific elements, using the fun option (xmlValue, xmlAttrs, xmlGetAttr, xmlName, xmlChildren, xmlSize).

For instance, xmlValue returns the node’s value:

XML::xpathSApply(parsed_doc, "//title", fun = XML::xmlValue)
[1] "Laws of the Internet"

xmlAttrs returns the node’s attributes:

XML::xpathSApply(parsed_doc, "//div", XML::xmlAttrs)
[[1]]
       id      lang      date 
   "wiio" "english"    "1978" 

[[2]]
     lang      date 
"english"    "1986" 

[[3]]
     lang      date 
"english"    "1958" 

[[4]]
     id 
"other" 

xmlGetAttr returns a specific attribute:

XML::xpathSApply(parsed_doc, "//div", XML::xmlGetAttr, "lang")
[[1]]
[1] "english"

[[2]]
[1] "english"

[[3]]
[1] "english"

[[4]]
NULL

17.3.3 Regular Expressions

Regular Expressions can be used to achieve the main web scraping objective, which is to extract relevant information from reams of data. Among this mostly unstructured data lurk systematic elements, which can be used to help the automation process, especially if quantitative methods are eventually going to be applied to the scraped data.

Systematic structures include numbers, names (countries, etc.), addresses (mailing, e-mailing, URLs, etc.), specific character strings, etc. Regular expressions (regexps) are abstract sequences of strings that match concrete recurring patterns in text; they allow for the systematic extraction of the information components from plain text, HTML, and XML.

The examples in this section are based on [356].

Initializing the Environment

The Python module for regular expressions is re.

import re 

Let us take a quick look at some basics, through the re method match(). We can try to match a pattern from the beginning of a string, as below:

re.match('super','supercalifragilisticexpialidocious') 
<re.Match object; span=(0, 5), match='super'>

Notice the difference in the following chunk of code:

re.match('super','Supercalifragilisticexpialidocious') 

The regular expression pattern (more on this in a moment) for “word” is \w+. The following bit of code would match the first word in a string:

w_regex = '\w+' 
re.match(w_regex,'Hello World!') 
<re.Match object; span=(0, 5), match='Hello'>

Common Regular Expression Patterns

A regular expression pattern is a short form used to indicate a type of (sub)string:
  • \w+: word

  • \d: digit

  • \s: space

  • .: wildcard

  • + or *: greedy match

  • \W: not word

  • \D: not digit

  • \S: not space

  • [a-z]: lower case group

  • [A-Z]: upper case group

 
In Python, regular expression patterns must be prefixed with an r to differentiate between the raw string and the string’s interpretation.

There are a few re functions which, combined with regexps, can make it easier to extract information from large, unstructured text documents:

  • split(): splits a string on a regexp;

  • findall(): finds all substrings matching a regexp in a string;

  • search(): searches for a regexp in a string, and

  • match(): matches an entire string based on a regexp

 
Each of these functions take two arguments: a regexp (first) and a string (second). For instance, we can split a string on the spaces (and remove them):

re.split('\s+','Can  you do the split?') 
['Can', 'you', 'do', 'the', 'split?']

The \ in the regexp above is crucial. The following code splits the sentence on the s (and removes them):

re.split('s+','Can  you do the split?') 
['Can  you do the ', 'plit?']

We can also split on single spaces and remove them:

re.split('\s','Can  you do the split?') 
['Can', '', 'you', 'do', 'the', 'split?']

Alternatively, we can also split on the words and remove them:

re.split('\w+','Can  you do the split?') 
['', '  ', ' ', ' ', ' ', '?']

Or split on the non-words and remove them:

re.split('\W+','Can you do the split?') 
['Can', 'you', 'do', 'the', 'split', '']

Let us take some time to study a silly sentence, saved as a string.

test_string = 'Oh they built the built the ship Titanic. It was a mistake. It cost more than 1.5 million dollars. Never again!'
test_string
'Oh they built the built the ship Titanic. It was a mistake. It cost more than 1.5 million dollars. Never again!'

In English, only three characters can end a sentence: ., ?, !. We create a regexp group (more on those in a moment) as follows:

sent_ends = r"[.?!]" 

We could then split the string into its constituent sentences:

print(re.split(sent_ends,test_string)) 
['Oh they built the built the ship Titanic', ' It was a mistake', ' It cost more than 1', '5 million dollars', ' Never again', '']

If we wanted to know how many such sentences there were, we simply use the len() function:

print(len(re.split(sent_ends,test_string)))
6

The regexp range consisting of words with an uppercase initial letter is:

cap_words = r"[A-Z]\w+" # Upper case characters

We can find all such words (and how many there are in the string) through:

print(re.findall(cap_words,test_string)) 
print(len(re.findall(cap_words,test_string))) 
['Oh', 'Titanic', 'It', 'It', 'Never']
5

The regexp for spaces is:

spaces = r"\s+" # spaces

We can then split the string on spaces, and count the number of tokens (see Text Analysis and Text Mining):

print(re.split(spaces,test_string)) 
print(len(re.split(spaces,test_string))) 
['Oh', 'they', 'built', 'the', 'built', 'the', 'ship', 'Titanic.', 'It', 'was', 'a', 'mistake.', 'It', 'cost', 'more', 'than', '1.5', 'million', 'dollars.', 'Never', 'again!']
21

The regexp for numbers (contiguous strings of digits) is:

numbers = r"\d+"

We can find all the numeric characters using:

print(re.findall(numbers,test_string)) 
print(len(re.findall(numbers,test_string))) 
['1', '5']
2

The main difference between search() and match() is that match() tries to match from the beginning of a string, whereas search() looks for a match anywhere in the string.

17.3.3.1 Regular Expressions Groups ( ) and Ranges [ ] With OR |

We can create more complicated regexps using groups, ranges, and/or “or” statements:

  • [a-zA-Z]+: an unlimited number of lower and upper case English/French (unaccented) letters;

  • [0-9]: the digits from 0 to 9;

  • [a-zA-Z'\.\-]+: any combination of lower and upper case English/French (unaccented) letters, ', ., and -;

  • (a-z): the characters a, -, and z;

  • (\s+|,): any number of spaces, or a comma;

  • (\d+|\w+): words or numerics

 
For instance, consider the following text string and regexps groups:

text = 'On the 1st day of xmas, my boat sank.'
numbers_or_words = r"(\d+|\w+)"
spaces_or_commas = r"(\s+|,)"

What do we expect the following chunk of code to do?

print(re.findall(numbers_or_words,text))
['On', 'the', '1', 'st', 'day', 'of', 'xmas', 'my', 'boat', 'sank']

What about this one?

print(re.findall(spaces_or_commas,text))
[' ', ' ', ' ', ' ', ' ', ',', ' ', ' ', ' ']

Now, consider a different string:

text = "will something happen after the semi-colon; I don't think so"

What might happen in each of the following cases?

print(re.match(r"[a-z -]+",text)) 
print(re.match(r"[a-z ]+",text)) 
print(re.match(r"[a-z]+",text)) 
print(re.match(r"(a-z-)+",text)) 

17.3.4 Beautiful Soup

Simple web requests require some networking code to fetch a page and return the HTML contents.

Browsers do a lot of work to intelligently parse improper HTML syntax (up to a certain point, of course), so that something like <a href="data-action-lab.com> <b>link text<a> </b>, say, would be correctly interpreted as <a href="data-action-lab.com> <b>link text</b></a>.

Beautiful Soup (BS) is a Python library that helps extract data out of HTML and XML files; it parses HTML files, even if they are broken. But BS does not simply convert bad HTML to good X/HTML; it allows a user to fully inspect the (proper) HTML structure it produces, in a programmatical fashion.333

Typical HTML elements to be extracted/read come in various formats, such as
  • text

  • tables

  • form field values

  • images

  • videos

  • etc.

 
When BS has finished its work on an HTML file, the resulting soup is an API for traversing, searching, and reading the document’s elements. In essence, it provides idiomatic ways of navigating, searching, and modifying the parse tree of the HTML file, which can save a fair amount of time.

For instance, soup.find_all(’a’) would find and output all <a ...> ... </a> tag pairs (with attributes and content) in the soup, whereas the following chink of code would output the URLs found in the same tag pairs.

for link in soup.find_all(’a’):
  print(link.get(’href’)`

The Beautiful Soup documentation is quite explicit and provides numerous examples [357]. We use the lyrics to Meet the Elements, a song by They Might Be Giants, to illustrate Beautiful Soup’s functionality.

html_doc = """
<html><head><title>Meet the Elements</title> <meta name="author" content="They Might Be Giants"></head>
<body><p class="title"><b>Meet the Elements</b></p>

<p class="author"><i>They Might Be Giants</i></p>

<div class="lyrics"><p class="verse" id="verse1"><a href="https://en.wikipedia.org/wiki/Iron" class="element" id="link1">Iron</a> is a metal, you see it every day<br>
<a href="https://en.wikipedia.org/wiki/Oxygen" class="element" id="link2">Oxygen</a>, eventually, will make it rust away<br>
<a href="https://en.wikipedia.org/wiki/Carbon" class="element" id="link3">Carbon</a> in its ordinary form is coal<br>
Crush it together, and diamonds are born</p>

<p class="chorus" id="chorus1">Come on, come on, and meet the elements <br>
May I introduce you to our friends, the elements? <br>
Like a box of paints that are mixed to make every shade <br>
They either combine to make a chemical compound or stand alone as they are</p>

<p class="verse" id="verse2"><a href="https://en.wikipedia.org/wiki/Neon" class="element" id="link4">Neon</a>'s a gas that lights up the sign for a pizza place <br>
The coins that you pay with are <a href="https://en.wikipedia.org/wiki/Copper" class="element" id="link5">copper</a>, <a href="https://en.wikipedia.org/wiki/Nickel" class="element" id="link6">nickel</a>, and <a href="https://en.wikipedia.org/wiki/Zinc" class="element" id="link7">zinc</a> <br>
<a href="https://en.wikipedia.org/wiki/Silicon" class="element" id="link8">Silicon</a> and oxygen make concrete bricks and glass <br>
Now add some <a href="https://en.wikipedia.org/wiki/Gold" class="element" id="link9">gold</a> and <a href="https://en.wikipedia.org/wiki/Silver" class="element" id="link10">silver</a> for some pizza place class</p>

<p class="chorus" id="chorus2">Come on, come on, and meet the elements <br>
I think you should check out the ones they call the elements <br>
Like a box of paints that are mixed to make every shade <br>
They either combine to make a chemical compound or stand alone as they are <br>
Team up with other elements making compounds when they combine <br>
Or make up a simple element formed out of atoms of the one kind </p>

<p class="verse" id="verse3">Balloons are full of <a href="https://en.wikipedia.org/wiki/Helium" class="element" id="link11">helium</a>, and so is every star <br>
Stars are mostly <a href="https://en.wikipedia.org/wiki/Hydrogen" class="element" id="link12">hydrogen</a>, which may someday fill your car <br>
Hey, who let in all these elephants? <br>
Did you know that elephants are made of elements? <br>
Elephants are mostly made of four elements <br>
And every living thing is mostly made of four elements <br>
Plants, bugs, birds, fish, bacteria and men <br>
Are mostly carbon, hydrogen, <a href="https://en.wikipedia.org/wiki/Nitrogen" class="element" id="link13">nitrogen</a>, and oxygen</p>

<p class="chorus" id="chorus3">Come on, come on, and meet the elements <br>
You and I are complicated, but we're made of elements <br>
Like a box of paints that are mixed to make every shade <br>
They either combine to make a chemical compound or stand alone as they are <br>
Team up with other elements making compounds when they combine <br>
Or make up a simple element formed out of atoms of the one kind <br> 
Come on come on and meet the elements <br>
Check out the ones they call the elements <br>
Like a box of paints that are mixed to make every shade <br>
They either combine to make a chemical compound or stand alone as they are</p>

</div>
"""

Note that the HTML file does not contain a </body> tag nor a </html> tag.

We import the BeautifulSoup module, and parse the file into a soup using the html.parser.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
<html>
 <head>
  <title>
   Meet the Elements
  </title>
  <meta content="They Might Be Giants" name="author"/>
 </head>
 <body>
  <p class="title">
   <b>
    Meet the Elements
   </b>
  </p>
  <p class="author">
   <i>
    They Might Be Giants
   </i>
  </p>
  <div class="lyrics">
   <p class="verse" id="verse1">
    <a class="element" href="https://en.wikipedia.org/wiki/Iron" id="link1">
     Iron
    </a>
    is a metal, you see it every day
    <br/>
    <a class="element" href="https://en.wikipedia.org/wiki/Oxygen" id="link2">
     Oxygen
    </a>
    , eventually, will make it rust away
    <br/>
    <a class="element" href="https://en.wikipedia.org/wiki/Carbon" id="link3">
     Carbon
    </a>
    in its ordinary form is coal
    <br/>
    Crush it together, and diamonds are born
   </p>
   <p class="chorus" id="chorus1">
    Come on, come on, and meet the elements
    <br/>
    May I introduce you to our friends, the elements?
    <br/>
    Like a box of paints that are mixed to make every shade
    <br/>
    They either combine to make a chemical compound or stand alone as they are
   </p>
   <p class="verse" id="verse2">
    <a class="element" href="https://en.wikipedia.org/wiki/Neon" id="link4">
     Neon
    </a>
    's a gas that lights up the sign for a pizza place
    <br/>
    The coins that you pay with are
    <a class="element" href="https://en.wikipedia.org/wiki/Copper" id="link5">
     copper
    </a>
    ,
    <a class="element" href="https://en.wikipedia.org/wiki/Nickel" id="link6">
     nickel
    </a>
    , and
    <a class="element" href="https://en.wikipedia.org/wiki/Zinc" id="link7">
     zinc
    </a>
    <br/>
    <a class="element" href="https://en.wikipedia.org/wiki/Silicon" id="link8">
     Silicon
    </a>
    and oxygen make concrete bricks and glass
    <br/>
    Now add some
    <a class="element" href="https://en.wikipedia.org/wiki/Gold" id="link9">
     gold
    </a>
    and
    <a class="element" href="https://en.wikipedia.org/wiki/Silver" id="link10">
     silver
    </a>
    for some pizza place class
   </p>
   <p class="chorus" id="chorus2">
    Come on, come on, and meet the elements
    <br/>
    I think you should check out the ones they call the elements
    <br/>
    Like a box of paints that are mixed to make every shade
    <br/>
    They either combine to make a chemical compound or stand alone as they are
    <br/>
    Team up with other elements making compounds when they combine
    <br/>
    Or make up a simple element formed out of atoms of the one kind
   </p>
   <p class="verse" id="verse3">
    Balloons are full of
    <a class="element" href="https://en.wikipedia.org/wiki/Helium" id="link11">
     helium
    </a>
    , and so is every star
    <br/>
    Stars are mostly
    <a class="element" href="https://en.wikipedia.org/wiki/Hydrogen" id="link12">
     hydrogen
    </a>
    , which may someday fill your car
    <br/>
    Hey, who let in all these elephants?
    <br/>
    Did you know that elephants are made of elements?
    <br/>
    Elephants are mostly made of four elements
    <br/>
    And every living thing is mostly made of four elements
    <br/>
    Plants, bugs, birds, fish, bacteria and men
    <br/>
    Are mostly carbon, hydrogen,
    <a class="element" href="https://en.wikipedia.org/wiki/Nitrogen" id="link13">
     nitrogen
    </a>
    , and oxygen
   </p>
   <p class="chorus" id="chorus3">
    Come on, come on, and meet the elements
    <br/>
    You and I are complicated, but we're made of elements
    <br/>
    Like a box of paints that are mixed to make every shade
    <br/>
    They either combine to make a chemical compound or stand alone as they are
    <br/>
    Team up with other elements making compounds when they combine
    <br/>
    Or make up a simple element formed out of atoms of the one kind
    <br/>
    Come on come on and meet the elements
    <br/>
    Check out the ones they call the elements
    <br/>
    Like a box of paints that are mixed to make every shade
    <br/>
    They either combine to make a chemical compound or stand alone as they are
   </p>
  </div>
 </body>
</html>

The parser has “fixed” the file by appending the missing tags.

BeautifulSoup Functionality

Is the functionality of BS clear from the following examples?

print(soup.title)
<title>Meet the Elements</title>
print(soup.title.name)
title
print(soup.title.string)
Meet the Elements
print(soup.title.parent.name)
head
print(soup.p)
<p class="title"><b>Meet the Elements</b></p>
soup.p['class']
['title']
print(soup.a)
<a class="element" href="https://en.wikipedia.org/wiki/Iron" id="link1">Iron</a>
soup.find_all('a')
[<a class="element" href="https://en.wikipedia.org/wiki/Iron" id="link1">Iron</a>, <a class="element" href="https://en.wikipedia.org/wiki/Oxygen" id="link2">Oxygen</a>, <a class="element" href="https://en.wikipedia.org/wiki/Carbon" id="link3">Carbon</a>, <a class="element" href="https://en.wikipedia.org/wiki/Neon" id="link4">Neon</a>, <a class="element" href="https://en.wikipedia.org/wiki/Copper" id="link5">copper</a>, <a class="element" href="https://en.wikipedia.org/wiki/Nickel" id="link6">nickel</a>, <a class="element" href="https://en.wikipedia.org/wiki/Zinc" id="link7">zinc</a>, <a class="element" href="https://en.wikipedia.org/wiki/Silicon" id="link8">Silicon</a>, <a class="element" href="https://en.wikipedia.org/wiki/Gold" id="link9">gold</a>, <a class="element" href="https://en.wikipedia.org/wiki/Silver" id="link10">silver</a>, <a class="element" href="https://en.wikipedia.org/wiki/Helium" id="link11">helium</a>, <a class="element" href="https://en.wikipedia.org/wiki/Hydrogen" id="link12">hydrogen</a>, <a class="element" href="https://en.wikipedia.org/wiki/Nitrogen" id="link13">nitrogen</a>]
print(soup.find(id="link5"))
<a class="element" href="https://en.wikipedia.org/wiki/Copper" id="link5">copper</a>
for link in soup.find_all('a'):
    print(link.get('href'))
https://en.wikipedia.org/wiki/Iron
https://en.wikipedia.org/wiki/Oxygen
https://en.wikipedia.org/wiki/Carbon
https://en.wikipedia.org/wiki/Neon
https://en.wikipedia.org/wiki/Copper
https://en.wikipedia.org/wiki/Nickel
https://en.wikipedia.org/wiki/Zinc
https://en.wikipedia.org/wiki/Silicon
https://en.wikipedia.org/wiki/Gold
https://en.wikipedia.org/wiki/Silver
https://en.wikipedia.org/wiki/Helium
https://en.wikipedia.org/wiki/Hydrogen
https://en.wikipedia.org/wiki/Nitrogen
print(soup.get_text())

Meet the Elements 
Meet the Elements
They Might Be Giants
Iron is a metal, you see it every day
Oxygen, eventually, will make it rust away
Carbon in its ordinary form is coal
Crush it together, and diamonds are born
Come on, come on, and meet the elements 
May I introduce you to our friends, the elements? 
Like a box of paints that are mixed to make every shade 
They either combine to make a chemical compound or stand alone as they are
Neon's a gas that lights up the sign for a pizza place 
The coins that you pay with are copper, nickel, and zinc 
Silicon and oxygen make concrete bricks and glass 
Now add some gold and silver for some pizza place class
Come on, come on, and meet the elements 
I think you should check out the ones they call the elements 
Like a box of paints that are mixed to make every shade 
They either combine to make a chemical compound or stand alone as they are 
Team up with other elements making compounds when they combine 
Or make up a simple element formed out of atoms of the one kind 
Balloons are full of helium, and so is every star 
Stars are mostly hydrogen, which may someday fill your car 
Hey, who let in all these elephants? 
Did you know that elephants are made of elements? 
Elephants are mostly made of four elements 
And every living thing is mostly made of four elements 
Plants, bugs, birds, fish, bacteria and men 
Are mostly carbon, hydrogen, nitrogen, and oxygen
Come on, come on, and meet the elements 
You and I are complicated, but we're made of elements 
Like a box of paints that are mixed to make every shade 
They either combine to make a chemical compound or stand alone as they are 
Team up with other elements making compounds when they combine 
Or make up a simple element formed out of atoms of the one kind  
Come on come on and meet the elements 
Check out the ones they call the elements 
Like a box of paints that are mixed to make every shade 
They either combine to make a chemical compound or stand alone as they are

17.3.5 Selenium

Selenium is a Python tool used to automate web browser interactions. It is used primarily for testing purposes, but it has data extraction uses as well. Mainly, it allows the user to open a browser and to act as a human being would:

  • clicking buttons;

  • entering information in forms;

  • searching for specific information on a page, etc.

Selenium requires a driver to interface with the chosen browser. Firefox, for example, uses geckodriver. Here are the driver URL for supported browsers:

  

Selenium automatically controls a complete browser, including rendering the web documents and running JavaScript. This is useful for pages with a lot of dynamic content that is not in the base HTML. Selenium can program actions like “click on this button”, or “type this text”, to provide access to the dynamic HTML of the current state of the page, not unlike what happens in Developer Tools (but now the process can be fully automated). More information can be found in [358][360].

17.3.6 APIs

An application programming interface (API) is a website’s way of giving programs access to their data, without the need for scraping. APIs provide structured access to structured data: not every bit of information will necessarily be made available to analysts.

For example, a finance site might offer an API with financial aggregate data, the New York Times might offer an API for news articles from a specific time period, Twitter might offer an API to collect tweets by users or hashtags, etc.

In all cases, the data will be available in a pre-defined, structured format (often JSON).

In the examples, the APIs we consider have R/Python libraries that encapsulate all required networking and encoding. This means that users only need to read the library documentation to get a sense for what needs to be done to get the data.334

17.3.7 Specialized Uses and Applications

Although we will not be discussing them in these notes, it could prove useful for web scrapers to learn how to handle:

HTML Forms

Sometimes we do not just want to receive data from the server, we also want to send data, such as a username/password combination to log in to a site. Other input types include: check boxes, radio buttons, hidden inputs, etc. Real users accomplish this by filling out forms and submitting them to the server. When this happens the browser looks at the form HTML and sends a request with the user inputs as parameters. The server can use those parameters to send back different data.

Encoding

What if we wanted to write “
” as text in an HTML file? If we just type it in as-is, it would be interpreted as an HTML tag, not as text. The solution is to use HTML encoding. In order to type “
”, we have to encode it in a special form of text that the browser understands. An HTML decoder/encoder can be found here.

Combination

HTML forms can specify a method for GET as well as for PUT. In that case the parameters are appended to the URL after a “?”, like so: http://search.yahoo.com/search/?p=data+analysis&lang=en. In that example, the parameter names are p and lang. The parameter value data+analysis actually represents the string “data analysis”, but spaces get encoded in URLs. Other characters (such as “/”) often are as well; use the https://www.urlencoder.org to get the correct strings.

References

[352]
S. Munzert, C. Rubba, P. Meiner, and D. Nyhuis, Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, 2nd ed. Wiley Publishing, 2015.
[355]
[356]
K. Jarmul, Natural Language Processing Fundamentals in Python.” DataCamp.
[357]
[358]
[360]