Cookies and User Tracking

HTTP is a stateless protocol, meaning that each request is separate and the server does not need to keep track of users in order to respond to requests. However, on the modern web, we often want to keep track of users and associate a sequence of requests with a single transaction. To achieve this, the cookie mechanism is layered on top of HTTP to allow some state to be maintained between requests.

However, since the early days of the introduction of cookies, the idea of tracking users has become an industry in itself. Knowledge of user's browsing habits is a saleable commodity and companies like Facebook and Google make their money by being able to customise advertising to the interests of users based on knowledge of their browsing habits. Tracking browsing habits is done using cookies (and other mechanisms) and so there is concern that using them violates user privacy. Based on this, there are now limits on what can be done with cookies.

This chapter will try to explain how cookies are used in general and then at the specific case of user tracking across different sites. We'll then look at how modern web standards and browsers are getting in the way of simplistic tracking and forcing companies to work harder to acquire user data.

Cookies are a mechanism for maintaining state in an HTTP transaction. They allow a server side application to store some data with the client which is returned each time the client makes a request to the same server.

Cookies are created when a server response includes a Set-Cookie header. When this is received, the browser stores the cookie for future use, associating it with the URL that the response came from. Depending on the settings in the cookie, it can be kept for the current browser session, for a period of time or indefinitely.

When a cookie has been stored for a given site URL, all subsequent requests to that site (subject to some controls) will contain a Cookie header that sends the cookie back to the server. In this way, the server can identify the user based on the cookie contents or use those contents in some way as state information in a transaction with that user.

Let's look at an example of both kinds of header. Here's a response from a server that sets a cookie:

HTTP/1.0 200 OK
Date: Wed, 21 Mar 2012 03:18:25 GMT
Server: WSGIServer/0.1 Python/2.7.2+
content-type: text/html
Set-Cookie: likes=cheese

The last header like contains a cookie called 'likes' with a value 'cheese', the browser will by default store this locally and send it back with any request to the same URL. Here is a request that includes the same cookie:

GET / HTTP/1.1
Host: localhost:8000
Connection: keep-alive
User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/17.0.963.56 Chrome/17.0.963.56 Safari/535.11
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Referer: http://localhost:8000/?like=cheese
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Cookie: likes=cheese

The cookie has two main parts, the name and the value. The name should be alpha-numeric with no whitespace or special characters, it can include dash - and underscore _. The value should also be only alpha-numeric characters without spaces, double quotes, commas, semi-colons or backslashes. The value can be up to 4096 bytes but much smaller values are most common.

While the name and value are the main parts of the cookie, there are also a number of parameter settings that can be sent along with it in the Set-Cookie header.

Here's an example of a cookie header with some additional parameters:

Set-Cookie: sessionID=2c014545; Secure; SameSite=None; Expires=Wed, 09 Jun 2021 10:18:14 GMT; Path=/;

Each parameter comes after the name=value part, separated by semicolons.

Domain and Path

By default, a cookie is sent back to the URL it came from and any URLs where that is a prefix. So, if a cookie comes from http://example.org/fruit/ it would be sent back to http://example.org/fruit/ and http://example.org/fruit/apples/ but not to http://example.org/vegetables/ or to http://www.example.org/fruit/.
The Domain parameter allows the server to dictate where the cookie will be sent back to. So, Domain=example.org would mean that it would be sent to both example.com and any sub-domains such as www.example.com.
The Path parameter allows the server to say which URL paths the cookie should be sent back to; Path=/ means that it will be sent to all paths in this domain even if it originated at `/fruit/'.

However, a server is not allowed to include a different domain in a cookie header. If the server at example.org returned a cookie with Domain=evil.org then that will not result in the cookie being sent in future requests to `evil.org. The cookie would be deemed invalid and ignored.

Expires and Max Age

Cookies can be kept in the browser indefinitely or for a fixed period. Two properties define how long. Expires=<date> says that the cookie should be kept until the given date (in UTC format <day-name>, <day> <month> <year> <hour>:<minute>:<second> GMT). Max-Age=<seconds> says that it should be kept for the given number of seconds.

If neither of these headers is present, the cookie will be kept until the end of the current browsing session. This can be a long time if users don't shut down their browser or if the 'session' is restarted by the browser when they restart, which is common.

Different uses of cookies will make different choices here. For a secure banking application, a very short-lived cookie would be used to maintain a login session so that re-login will be required after a short period. For a social media application, a longer expiry would be set to make it more convenient to the user and not require frequent logins.

Secure and SameSite

If the Secure parameter is present, the cookie will only be returned in a request if it goes over a secure https channel.

The SameSite parameter controls whether a cookie is sent when the request is for a resource embedded in another page. For example, a page at http://example.com/ contains an image that is hosted at http://advertising.com/; the request for the image returns aSet-Cookie header so that the browser now holds a cookie for that site. If the page is requested again, the browser will look at the SameSite parameter on the cookie to decide whether to return it with the request.

If SameSite=Strict, then the cookie will only be sent back for same-site requests - that is if the user is visiting a page on advertising.com. If the request is for an image embedded in example.com the cookie will not be sent. Even if the user clicks on a link in the page that sends them to advertising.com, the cookie will not be sent with the request.

If SameSite=Lax (which is the default if SameSite is not mentioned) the only difference is the last case where a user clicks on a link to navigate to another site. In this case, the cookie for the other site will be sent.

The final option is SameSite=None which means that the cookie will be sent with any cross-site requests. However, this will only work if Secure is also set, so this will only work over an https connection. This option is needed if you want to keep track of users across multiple sites (see below).

Some browsers (e.g. Firefox), block cookies with SameSite=None by default (part of their Advanced Tracking Protection).

HttpOnly

A cookie marked as HttpOnly is not accessible from Javascript code running in the browser page. Any other cookie can be read and modified by Javascript code.

User Tracking with Cookies

A very common use-case for cookies is tracking user browsing habits for the purposes of building a profile. This is done, for example, by advertising companies like Google who's business is to serve advertising to any site on the web. Let's look at how this tracking takes place.

A website https://example.com/ agrees to host advertising from https://ad.com/ in return for a fee per page view. To do so, an image is embedded in the pages of example.com with the url https://ad.com/banner.png.

<img src='https://ad.com/banner.png' alt='advertising'>

When a user visits the example.com home page, the browser will send a request to ad.com for the image. The response contains a Set-Cookie header:

Set-Cookie: sessionid=k91j30d81ked; Path=/; Secure; SameSite=None; Expires=Wed, 09 Jun 2024 10:18:14 GMT;

As the user views the page on example.com they will see the persuasive advertising image. The ad.com server creates a database entry for user k91j30d81ked and adds https://example.com to their browsing history.

The user browses to a new page https://example.com/bears/ which also contains an embedded advertising image from ad.com. Since the SameSite parameter is set to None, the browser will send the cookie along with the request and so the server at ad.com will know that this is the same user as before and can update the browsing history.

The server at ad.com can also keep track of how much it needs to pay example.com by counting the number of requests it gets with a Referrer header of https://example.com. (Note that by default, the referrer header only shows the top level address of the referring site, not the sub-page within it.)

Now, ad.com has also hosts advertising at another site, http://bobalooba.com/ which again embeds advertising images in its pages. When a user who has previously visited example.com visits a page on bobalooba.com, their browser will request the advertising image and send along the cookie. The ad.com server again adds to the user's browsing history and no knows that this user, anonymously identified by their session id, is interested in the content on both of these sites.

If ad.com can sell advertising services to many sites, they can build up a profile of each user's interests and start to serve custom advertising to them. If user k91j30d81ked tends to browse sites about cycling, they can be served ads from cycling suppliers. This is how the business of modern web advertising is built.

Of course, this same pattern can serve more than advertising. Being able to learn about the profile of a user is valuable to many groups and there have been many instances of abuse of this information for various reasons. For this reason, cookie tracking is viewed suspiciously. Regulations in the EU mean that any website serving European users must ask for permission to use cookies (even first-party cookies) - which is why we often see the cookie permission pop-up when first visiting a site.

Some web browsers, notably Firefox, have a default setting that blocks third party cookies like these. This prevents advertisers tracking the user's browsing behaviour and disrupts the advertising model. To get around this, advertising companies can use alternate methods of tracking users.

User Tracking without Cookies

Since user-tracking data is so valuable, the companies that collect it have found alternate ways to track users that don't use cookies which can be easily disabled. This is a bit of an arms race so enumerating all of the ways that it can happen is never going to be exhaustive. Here are some alternate methods that have been used.

ETag Header

The ETag Header is part of the HTTP specification used for cache validation. The value supplied in an ETag header is a unique identifier for a particular version of a resource. If that resource changes (a page is updated), then the generated ETag value would be different. If a browser has a version of a resource in its local cache, it will send the ETag of that version along with an HTTP request (as the If-None-Match header). If the version is unchanged, the server will respond with a 304 Not Modified response, meaning that the browser should use the version it already has.

Since the ETag will be sent back to the server with subsequent requests, it can be used to track users. The tracking server simply sends a unique ETag per user with a given resource (maybe a javascript file or stylesheet). When the user sends a subsequent request for that resource, it sends the ETag which allows the site to track the user.

A user can clear any ETag trackers by clearing the browser cache.

Browser Cache

The browser will store copies of recently accessed resources in a cache to make future requests for the same pages faster. A server can send a version of a Javascript file containing a unique identifier for a user:

...
sessionid='k91j30d81ked';
...

this file will be cached and re-used whenever it is referenced again in another page. The tracker can then send this value along with requests for resources like advertising images, eg. by appending the value to the end of the image URL. In this way, the user can be tracked without the use of a cookie.

Again, these trackers could be removed by clearing the browser cache.

Browser Fingerprinting

A final example is the use of browser fingerprinting. Here, the tracker uses Javascript code to probe for features in the user's browser. For example, the operating system, browser version, collection of plugins installed. If enough features can be collected, there is a very low probability of it being exactly the same as another user. Hence, the user can be tracked by reference to this browser signature.

It is very difficult to protect against browser fingerprinting as a tracking measure.