The Uniform Resource Locator (URL)
Let's look at the anatomy of a URL:
http://www.mq.edu.au/about/profile/history.html
The first part of the URL is http://
, this is often left out when URLs
are written, so we might see the above as
www.mq.edu.au/about/profile/history.html
. This will work when you type
it into your web browser because the browser will assume you meant this
as an HTTP URL. However, if we are being pedantic, the prefix is
required as it tells us something about the web address - that we should
use the HTTP protocol to access it.
URLs are not only useful for HTTP addresses. Other schemes are allowed
and they tell the client how to get access to the content that the URL
describes. The most common you will see is https://
which is the
secure variant of HTTP (we'll find out about that later). HTTPS uses the
same protocol as HTTP but over a secure connection so that information
that is exchanged can't be intercepted. Another scheme is ftp://
which
tells the browser to use the older FTP protocol to access the content.
Finally, you might see a mailto
URL which allows us to write a link
that triggers the browser to start composing an email message. This has
a slightly different form: mailto:Steve.Cassidy@mq.edu.au
but it's
still a URL. In fact, the pattern looks like this:
<scheme>:<scheme-specific-part>
For the HTTP scheme the pattern is:
http://<net_loc>/<path>;<params>?<query>#<fragment>
The first part <net_loc>
is the network location of the resource: the
domain name of the web server that holds the web page we want. This is
preceded by two forward slashes and followed by a single forward slash.
Then follows the <path>
which the server will use to identify which
page we want. The ;<params>?<query>#<fragment>
are used to further
qualify the resource that we want; we'll see some examples of how they
are used later in the book.
The URL is a unique form of identifier in that it has two roles. First, it identifies a document, video or music file that's out on the web (we call these things resources). Each resource has a unique URL and you can refer to them via the URL. If we two people refer to the same URL, then they know they have read the same document. Secondly, the URL contains enough information for a web client to retrieve a copy of the resource. The client can use the scheme defined (HTTP) to connect to the server and send an HTTP request for the path part of the URL.
There are other kinds of formal identifier that don't have this second property. For example, every book has an ISBN (International Standard Book Number) which uniquely identifies it. However, the ISBN contains no information about how to get a copy of the book. To do so, you'd need to look up the ISBN in a catalogue which might tell you who the publisher was and then contact them for a copy. Alternately you could go to a book-shop or library and use their catalogue to find the book. With a URL, there's no need for a catalogue or third party service to decode the identifier. The information on how to retrieve the resource is right there in the Uniform Resource Locator.
The Internet Engineering Task Force (IETF) is as close as we get to a governing body for the Internet and it's home to many of the standards that define how the Internet and the Web work. The standards documents are called Request for Comments or RFC, a name which reflects the democratic nature of the early Internet. RFC2068 defines HTTP version 1.1 and RFC1738 defines the URL. Look out also for RFC2324 which can be relevant after a long coding session.
Uniform Resource Identifiers
There's another similar term that is sometimes used instead of URL: Uniform Resource Identifier (URI). URI is actually a more general term that includes URLs and Universal Resource Names (URNs). URNs are identifiers that don't contain any locator information such as the ISBN. They have a formal syntax so I can cite a particular ISBN as a URN as URN:ISBN:978-0-395-36341-6 (an example from the IETF RFC3187 document that defines the standard). As explained above, we need a third party service to resolve a URN into a real location (a URL). You might see the term URI used in discussions of web technologies; it is often used instead of URL but generally means the same thing if we're talking about HTTP based services.
Since URLs are both names and addresses it becomes important that once you put a resource on the web, it stays there. This is discussed in Tim Berners-Lee's paper called "Cool URIs don't change". It's a principle that all web designers should bear in mind as it is easy to violate as we re-build old web-sites.
Absolute and relative URLs
When we include a URL in a web page there are a number of choices about how it can be written. The first option is to include the full URL:
<a href="http://example.org/static/style.css">
This is clearly a link to a resource on the server at example.org using the http protocol. However, if this page is being served by the server at example.org then we could also write this link as:
<a href="/static/style.css">
This is known as a relative URL and is interpreted relative to the URL of the
page that is currently being viewed. Let's assume that this page is at the URL
http://example.org/about/info.html
. Note that the URL above starts with a /
character, in this case, the browser will interpret this URL by adding the
protocol (http
) and domain (example.org
) parts of the page URL to this one
to get http://example.org/static/style.css
.
Another way to write a URL is:
<a href="static/style.css">
In this case there is no /
at the start of the URL and so this is interpreted relative to the
page URL by removing everything but the last part of the URL http://example.org/about/
and
then adding the new URL text to get http://example.org/about/static/style.css
. Note that
this is different to the previous example. If the intention was to reference the same
URL as before, you could use the URL:
<a href="../static/style.css">
Here the ..
path refers to the parent directory (thinking of these paths as filenames)
so the path becomes /about/../static/style.css
which can be shortened to /static/style.css
.
The different URL styles have different uses. If you are writing a static HTML
page that you will view on your local computer and perhaps host on the web
somewhere (but you're not sure where) then you might put all of your files in
one directory and use a relative URL like static/style.css
to refer to linked
resources. This ensures that your directory of files could be dropped anywhere
into a web server file system and it would be self-contained and the links would
work.
However, if you are writing a web application that will be hosted on some domain
(like example.org
) you know that you have control of all URLs on that site and
so would more likely use a relative URL starting with a /
(like
/static/style.css
). This is important if your web application has complex
routes (eg. pages like /users/steve/profile/edit
) and you want to ensure that
whatever the page URL being served, the links to stylesheets and other resources
will work. The application can be deployed on different servers (like
example.org
and example.com
) and the links will still work because they
don't mention the domain name at all.
Finally, absolute URLs (like http://example.org/static/style.css
) will be used when we
know that this URL is fixed at this location. It may be something external to our own
site, or something that we know will be hosted at this URL on our site. There is
an argument from a Search Engine Optimisation (SEO) viewpoint that all URLs in a
website should be absolute urls even when they refer to assets on the same site.
The reasons relate to the way that search engines crawl sites and the possible duplication
of content if two URLs point to the same page (eg. https://example.org/
and http://example.org
).
One final form of relative URL looks like this:
<a href="//example.org/static/style.css">
This URL is only missing the protocol part and is turned into an absolute URL by adding the protocol part of the current page URL. So if the current page was requested over http or https, this URL will use the same protocol. This is often useful if a site can be served over both protocols although it is increasingly the case that https is being used everywhere so this may become less common as time progresses.