Crawling the HTML Web
The Web for Machines
The web was invented as a way of publishing documents for people to read. The HTTP protocol was defined as a lightweight way of communicating between a browser and a server to request documents and later to submit form data to servers. However, fairly soon after the web broke out of the CERN labs, people started writing programs to access web documents. Some of the first programs were web crawlers that could download whole chunks of the web so that you could read them offline (remember this was when we didn't have smart phones or even widespread home Internet). Later the search engines started to crawl the web to index it and help people find information. Before too long people started to scrape web pages for some of the data they contained - events, inventory, contact lists - so they could load it into a regular database and query it. Programmers were treating the web as a source of data and using HTTP to access it.
To act as a web client, a program needs to be able to make HTTP
requests. So far we've seen how to write Python code to accept
requests using the Bottle framework. Python is also quite good at making
requests and has the
urllib module to
support quite complex interactions with a web server. It can be very
simple to read the contents of a web page; we use the urlopen
function
which behaves just like opening a file but takes a URL instead of a
filename as an argument. Here's an example of reading a web page from a
given URL and returning the content as a Python string:
from urllib.request import urlopen
def get_page(url):
"""Get the text of the web page at the given URL
return a string containing the content"""
fd = urlopen(url)
content = fd.read()
fd.close()
return content.decode('utf8')
Note the last line in this function that converts from the binary data that is returned from the call to read into a regular Python string using the UTF-8 encoding. This is necessary if we want to do any string processing on the data that is returned.
This function could be used to 'download' a copy of a page for offline viewing, but it could also be the basis of a simple 'web crawler'. A web crawler is an application that follows links in web pages, and by doing so gets access to a large number of web pages. It can be used as the basis of a search engine or for downloading large chunks of web pages for reading or analysis. The core idea is to first get the web page at a URL, then scan the HTML page for links to other pages which you then add to a growing list of URLs. The first job then is to write a function to find the URLs in an HTML page. We'll look at a more sophisticated way to do this shortly but a simple approach is to use a regular expression. Here's some code that uses a regular expression to find URLS after first getting the text of the page:
def get_links(url):
"""Scan the text for http URLs and return a set
of URLs found, without duplicates"""
text = get_page(url)
# look for any http URL in the page
links = set()
urlpattern = r"(https?\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/[^<'\";]*)?)"
matches = re.findall(urlpattern, text)
for match in matches:
links.add(match[0])
return links
This function uses a Python set
rather than a list as we only want to
have one copy of each URL in the list we collect. The URL pattern is a
reasonably good one but probably doesn't catch all URLs in a page. In
particular it won't get any relative URLs that don't start with
http
. We use the re.findall
function to find all matches for the
pattern in the page and then add them one by one to the result set.
The next step in writing a web crawler is just to call get_links
repeatedly to retrieve the links from a page, then choose a new URL and
repeat. If we did this without limit, the program would keep going for a
long time since we are never going to run out of links. So, as a
demonstration, here's a function that will keep collecting links until
it has more than a pre-determined number.
def crawl(url, maxurls=100):
"""Starting from this URL, crawl the web until
you have collected maxurls URLS, then return them
as a set"""
urls = set([url])
while(len(urls) < maxurls):
# remove a URL at random
url = urls.pop()
print("URL: ", url)
links = get_links(url)
urls.update(links)
# add the url back to the set
urls.add(url)
return urls
Note that we use the pop()
method on the set to remove a random URL
which we then add back to the set later. We print out a progress message
each time around the loop just so we can see which pages are being
requested. To tie this all together we can call the crawl
function on
a starting URL and print out the resulting set of links:
if __name__=='__main__':
url = 'http://www.python.org/'
links = crawl(url, 100)
print("Collected ", len(links), " links:")
for link in links:
print(link)
When I run this it retrieves four or five URLs before reaching the limit of 100 and returning. If you try this please take care that you don't set the limit too high. You are generating automated traffic on the web which can cause a heavy load on a server if left unchecked - your script may be able to make many hundreds of requests a second, many more than you would make from a browser. For this reason, your script could be seen as an attempted attack on a site if it makes a large number of requests. Please use care when running web crawling scripts.
This crawler isn't really doing anything very useful but it is able to find as many web pages as you want as long as they are linked to the starting page. To do something useful, we need to do a bit more work to understand the pages that we are retrieving - rather than just scanning for URLs. Understanding the pages might also help with finding those relative URLs that we ignored in this version; so let's look at parsing HTML to extract information.
Parsing HTML
HTML is a formal language, we know that there are standards that define how HTML pages should be structured, the names of the elements that can be used and how they can be nested within each other. This means that we should be able to parse the page into an internal data structure to build an internal representation of the page. This is exactly what your web browser must do to render the page on the screen - parse the HTML to find the title, headings, paragraphs and lists and then apply the CSS stylesheet to determine the layout. Our requirement here is to be able to understand the page structure to make extracting data from it more reliable.
An HTML page can be seen as a kind of tree strucure. It has one top
level element <html>
that contains other elements (<head>
and
<body>
) which in turn contain other elements. The start and end tags
for these elements are supposed to be properly nested. Following this,
it is common for an HTML parser to scan the text of the page and build a
tree structure in memory. In many systems this is implemented as a
nested collection of objects and the representation is called the
Document Object Model or DOM. Once the page has been parsed into the
DOM, the programmer can find things within it. For example, find all of
the headings, find the first paragraph after the main heading or find
all of the anchor (<a>
) tags in the document. The DOM parser takes
care of all of the details of the HTML tags in the document and allows
the programmer to work on a more logical structure.
There are DOM based HTML parsers in many languages. Most notably perhaps is Javascript where the DOM is used to expose the internal structure of the current HTML page to the in-page Javascript code. So a script that wanted to change the text in the main heading could write:
headings = document.find_element_by_tag_name('H1');
headings[0].innerHTML = "Hello World";
In Python, there are DOM based parsers available but the best interface for HTML parsing is in a module called Beautiful Soup. The name of this module comes from the characterisation of HTML as found on the web as Tag Soup rather than well structured, standards compliant documents. Since anyone can publish on the web and since the quality of software used to generate pages varies so much, we can't rely on HTML pages sticking to the rules. Beautiful Soup is an HTML parser written to be able to do a reasonable job on even poorly formatted HMTL text. The interface provided by Beautiful Soup is similar to the common DOM interface but has some more convenience methods.
A Better Crawler
Let's look at an version of the get_links
method above to make the web
crawler a bit smarter. This version will use Beautiful Soup to find all
of the anchor elements in the page and return the href
attribute.
from bs4 import BeautifulSoup
def get_links(text):
"""Scan the text for http URLs and return a set
of URLs found, without duplicates"""
# look for any http URL in the page
links = set()
soup = BeautifulSoup(text)
for link in soup.find_all('a'):
if 'href' in link.attrs:
links.add(link.attrs['href'])
return links
Here we use the find_all
method on the soup object to find all of the
anchor elements in the document. This returns a list of element objects
that we then iterate over. If the anchor element has an href
attribute
then we add it to the link set. This is a much neater solution than the
regular expression code and it will be much more reliable; all of the
complexity of the HTML is dealt with by Beautiful Soup.
The results from this code are a little different to before; since we
are harvesting all anchor links, we get relative as well as absolute
URLs. Here's a sample of the results from crawling
http://www.python.org/
:
/events/python-events/past/
/downloads/
http://trac.edgewall.org/
http://blog.python.org
/accounts/signup/
https://wiki.python.org/moin/
http://www.pygtk.org
#content
/events/python-user-group/past/
#top
In addition to the http URLs we have relative URLs starting with /
and
page-internal links starting with #
. We can't just add these to the
list of links, we need to resolve them to full URLs first. Page-internal
links just refer to parts of the same page so there is no value in
adding them to the list of links. URLs starting with /
are on the same
site as the original page so we can turn them into full URLs by adding
the domain part of the original URL. So /downloads/
here becomes
http://www.python.org/downloads/
. There is a function in urllib
to
do this for us; urllib.parse.urljoin
takes an absolute URL and a
relative one and returns the combined URL. So, we can modify get_links
to make use of this as follows:
from urllib.parse import urljoin
def get_links(url):
"""Scan the text for http URLs and return a set
of URLs found, without duplicates"""
# look for any http URL in the page
links = set()
text = get_page(url)
soup = BeautifulSoup(text)
for link in soup.find_all('a'):
if 'href' in link.attrs:
newurl = link.attrs['href']
# resolve relative URLs
if newurl.startswith('/'):
newurl = urljoin(url, newurl)
# ignore any URL that doesn't now start with http
if newurl.startswith('http'):
links.add(newurl)
return links
The code rejects any link that hasn't been resolved to an http URL, this
deals with special cases such as Javascript URLs and the page-internal
#
URL.
Extracting Information from a Page
Now that we can parse the page structure there are more options open to us in extracting information from web pages. A very common requirement is to find useful information in a page that could be stored in a relational database for later query. This is known as screen scraping since we are scraping structured data from the HTML designed to be displayed on screen.
As an example, let's try to extract information about assignments from the COMP249 Unit Guide. The unit guide is generated by an application that uses a standard template for generating pages, this means that every unit guide has the same structure and we can rely on this structure to find the information we want. Looking at the HTML source for the appropriate section we have:
<section id="assessment-tasks-section" class="section">
<h2>Assessment Tasks</h2>
<table class="table table-striped">
<thead>
<tr>
<th>Name</th>
<th>Weighting</th>
<th>Due</th>
</tr>
</thead>
<tbody>
<tr class="assessment-task-row">
<td class="title">
<a href="#assessment_task_77482">Web Application Design</a>
</td>
<td class="weighting">
5%
</td>
<td class="due">
Week 4
</td>
</tr>
...
</tbody>
</table>
...
The HTML is very well structured using the HTML-5 section
tag to
define sections and with a unique id assessment-tasks-section
marking
the section we're interested in. We want to find this section and then
find the first table within the section. Then for each row in the body
of the table, we can get the three columns that correspond to the
assignment name, weighting and due date. Here is some code that will do
this using Beautiful Soup:
def find_assignments(text):
"""Given the text of an HTML unit guide from
Macquarie University, return a list of the assignments for
that unit as (name, weighting, due)"""
soup = BeautifulSoup(text)
# find the section that has the assignment table
section = soup.find(id='assessment-tasks-section')
# we want the first table in this section, then the tbody inside that
# and we want all rows in the body
tablerows = section.table.tbody.find_all('tr', 'assessment-task-row')
result = []
for row in tablerows:
name = row.find('td','title').a.string.strip()
weight = row.find('td','weighting').string.strip()
due = row.find('td','due').string.strip()
result.append((name, weight, due))
return result
This example shows how to navigate the HTML document using the object
attributes provided by Beautiful Soup. The find
and find_all
methods
allow us to search the document tree for matching elements. To notation
section.table
refers to the first table inside the section. Finally we
extract the content of the table cells using
row.find('td', 'due').string
which finds the td
element with class
due
inside the row and returns the content of the element as a string.
Note that for the name, we need to look one level deeper inside the
anchor element to get the string content. Running this code over the
unit guide linked
above
gives us:
[('Web Application Design', '5%', 'Week 4'),
('Workshop exams', '16%', 'Week 3, 6, 9, 12'),
('Web Application', '20%', 'Weeks 7 and 10'),
('Report', '14%', 'Week 12'),
('Exam', '45%', 'TBA')]
Screen scraping is a very useful technique for extracting information from web pages but it relies heavily on the structure of the web page being stable. If the University were to redesign the layout of the unit guide, the code may fail to find the relevant information and would need to be re-written. If you also wanted to get information from unit guides at other Universities, you would need to write a special extraction function for each one and maintain them all separately. This can clearly be done but is a fragile and labour-intensive process, especially when we realise that the original source of this data was a relational database. This brings us to thinking that there should be a better way to make data available for machine consumption over the web. We'll look at some solutions to that problem in the next chapter.
Controlling The Robots
We've seen in this chapter how to write scripts to automate access to
web resources. As mentioned earlier, this is potentially quite annoying
and can in fact be dangerous as we can make many hundreds of requests
per second to a server that could become overloaded. Recall that the
server may just be serving static pages but these days it is quite
likely that it is generating content from a database. This means that
the server needs to do some work for each request and thousands of
requests may slow down a server. In some cases, generating a page can
take a lot of resources - imagine a page that displayed up to the second
weather readings that needs to query multiple remote sources to generate
each page. For this reason, we need some way to tell remote users that
it is not ok to automatically retrieve pages from a site with a script.
The
robots.txt
file is the mechanism that is used to do this.
The robots.txt
file is placed in the document root of a web server so
that it is available at the URL e.g. http://example.com/robots.txt
. It
contains a set of statements that define which URLs on a site should or
should not be accessed by one or more crawlers (robots). An example
robots.txt
file (retrieved from http://www.mq.edu.au/robots.txt) is:
User-agent: SiteCheck-sitecrawl by Siteimprove.com
Allow: /images/
User-agent: *
Disallow: /archive/
Disallow: /cgi-bin/
Disallow: /Finance_Handbook/
Disallow: /images/
This example first has a rule for the robot that identifies itself as
"SiteCheck-sitecrawl by Siteimprove.com" saying that it is ok for it
to access any URL starting with /images/
. There is then a rule for all
user agents that lists a number of URL prefixes that should not be
touched. This would include SiteCheck, so the effect of this is that
SiteCheck is given special permission to look in /images
but no other
robot can.
An important point to note is that the rules in the robots.txt
file
are entirely voluntary. In the scripts we wrote earlier in the chapter,
we did not refer to the robots file on the Python website. A well
behaved crawler will first access the robots.txt
file on a site. If
there is no response to the request (for example,
http://unitguides.mq.edu.au/robots.txt) then we assume that any crawling
is ok. If there is a response then we should read the file and check
whether the URL we want to access is disallowed. Of course, if I'm an
Evil Web Crawler, I can just ignore this file all together.
Since the robots.txt
file does not need to be followed, a website that
finds it is being inundated with requests from a particular crawler
needs to take other steps to block it. If the crawler uses a particular
User-Agent string then the server could be configured to disallow that.
Alternately if all of the requests come from a particular IP address
then that could be blocked. Ultimately if the source of the requests is
an attacker, it may be hard to block the traffic, but for scripts
written by people who just didn't know that they were supposed to follow
some rules, things can be done.
Understanding the robots.txt
file would take some work, luckily as
Python developers the work has been done for us and we can use the
urllib.robotparser
module to do it for us. This module will read the robots.txt
file for
a site and then tell you whether or not you are allowed to retrieve a
given URL. We can use it to modify our implementation of get_page
from
earlier in the chapter. First we write a function to check that the URL
we want to access is allowed:
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
# Dictionary to save parsed robots.txt files
ROBOTS = dict()
def robot_check(url):
"""Check whether we're allowed to fetch this URL from this site"""
netloc = urlparse(url).netloc
if netloc not in ROBOTS:
robotsurl = urljoin(url, '/robots.txt')
ROBOTS[netloc] = RobotFileParser(robotsurl)
ROBOTS[netloc].read()
return ROBOTS[netloc].can_fetch('*', url)
This implementation takes care not to request the same robots file more
than once. We define a global dictionary ROBOTS
and store the parsed
robots file in this using the network location (domain name plus port,
eg. www.mq.edu.au or localhost:8080) as a key. Before reading the robots
URL we check whether we have already stored one. If not, then we read
the robots file and store the result in the ROBOTS
dictionary. This is
an example of cacheing of results to avoid having to do the same task
more than once; it also avoids sending repeated requests to the server
for the same URL.
The next step is to modify get_page
to run the check before retrieving
the URL:
def get_page(url):
"""Get the text of the web page at the given URL
return a string containing the content"""
if robot_check(url):
fd = urlopen(url)
content = fd.read()
fd.close()
return content.decode('utf8')
else:
return ''
After implementing this, we now have a well behaved robot crawler.
One final note. You will notice that the robots.txt
file identifies
robots based on their User-Agent
header. This is one of the standard
HTTP headers that every request should contain. Your browser uses this
to identify itself and a server can be configured to deliver different
content to different user agents. By default, the Python urllib
module
identifies itself as Python-urllib/3.4
(for Python 3.4), but it is
good practice to set the user agent to a unique value for your script.
To do this we can't use the simple urlopen
function, but it's not too
much more complicated. We need to use a custom URL opener that is
configured to send a different user agent header. Here's a final version
of get_page
that identifies our robot as "PWPBot":
def get_page(url):
"""Get the text of the web page at the given URL
return a string containing the content"""
if robot_check(url):
# open the URL with a custom User-agent header
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'PWPBot/1.0')]
fd = opener.open(url)
content = fd.read()
fd.close()
return content.decode('utf8')
else:
print("Disallow: ", url)
return ''