Scripting Forms

POST and GET Requests

You'll recall that when a user fills out a form on an HTML page, the browser will collect together the form data and send either a GET or POST request to the URL defined in the action attribute, depending on the method attribute. When the method is GET, the form data will be encoded and added to the URL in the GET request; when the method is POST, the form data is encoded in the same way but this time is sent as the body of the request.

Just as we can make requests for web pages using urlopen we can also mimic the behaviour of the browser in sending form data back to a server. By default, urlopen will send a GET request when given a URL but if we supply an extra data argument, it will send a POST request. We can use this facility to automate the submission of forms on the web.

There are many reasons one might wish to automate form submission.

This last example highlights how it might be possible to behave badly with this knowledge of how to script HTTP requests. It is relatively easy to submit large quantities of votes for online polls or entries for competitions with a script. If you are running such a website, you will want to guard against this kind of behaviour if you want your poll to be representative or your competition fair. We'll look at the problem of blocking this behaviour later in this chapter.

Sending a GET Request

<form method='GET' action='http://bing.com/search'>
    <label for='q'>Query:</label><input name='q'>
    <label for='first'>First Result</label><input name='first'>
    <input type='submit'>
</form>
     

This form can be used to send a query to the Bing search engine from Microsoft. There are other form fields that are normally sent but just sending the q field will get a page of results. I've also added the first field which tells Bing which result to start with - this is used in returning the different pages of results. (Note that Google used to support queries like this but has now made it more complicated to send queries from third party sites). To simulate this query from a script we need to send the form field in a GET request. This means encoding it as part of the URL following a ? character; so to search for 'cheese' we would send a GET request to http://bing.com/search?q=cheese&first=0. For a simple query this is just a string concatenation but in general, we are supposed to encode the query string so that characters that are not allowed in a URL are replaced by entities. To search for "Steve's Homepage" we need to add Steve%27s%+Homepage - for a discussion of the encoding scheme see Percent-encoding on Wikipedia.

Luckily we don't need to implement this encoding ourselves, urllib contains a function urllib.parse.urlencode to do this for us. Using this, we can write a function to submit a query to Bing as follows:

from urllib.parse import urlencode
from urllib.request import urlopen

def bingsearch(term, first):
    """Perform a search on Bing for this term, return 
    the resulting search page starting with the result
    number 'first'"""
    
    url = "http://bing.com/search"
    
    data = urlencode({'q': term, 'first': first})
    
    queryurl = url + "?" + data
    
    fd = urlopen(queryurl)
    content = fd.read()
    fd.close()
    
    return content.decode('utf-8')
     

The urlencode function takes a dictionary of form fields and values and returns a string with the encoded data. This string is then added to the end of the URL following a ? character and the page is requested as in the previous chapter. The resulting page will contain all of the links to matching pages found by Bing. We could use the get_links function from the previous chapter to extract the links from the page for further processing.

Sending a POST Request

<form action='/submit' method='POST'>
    <label for='name'>Name</label>:<input name='name'>
    <label for='age'>Age</label>: <input name='age'>
    <input type='submit'>
</form>
     

For an example of a POST request we will consider the above form that can be used to submit a name and age to a web application. If this form is served from the page http://127.0.0.1:8080/form then the URL we need to submit to is http://127.0.0.1:8080/submit. The code to submit the form is similar to the GET example above, but this time we need to pass the form data to the urlopen function instead of adding it to the URL. Here is the code:

def send_name_age(name, age):
    """Perform a search on Bing for this term, return 
    the resulting search page starting with the result
    number 'first'"""
    
    url = "http://127.0.0.1:8080/submit"
    
    data = urlencode({'name': name, 'age', age})
        
    fd = urlopen(url, data.encode('utf-8'))
    content = fd.read()
    fd.close()
    
    return content.decode('utf-8')
     

To pass data into urlopen it must be formatted as a byte string, this is the reason for the call to data.encode('utf-8'). Again the content returned will be whatever the web application returns in response to this form submission.

It is worth noting that urlopen will by default follow any "303 See Other" (redirect) responses that it gets in response to the initial request. That means that if you send a login request with a username and password which returns a redirect response, you will get back the contents of the redirected page.

In the example of a login page, it would be common to have a cookie returned with the response. If we wanted to carry out requests on behalf of the logged in user, we would need to save the cookie and send it back to the site in later requests, just as the web browser would do. Python provides an object called CookieJar to keep track of cookies. To use it we need to install a custom request handler for the urllib module:

import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
    

Once installed, the cookie jar will keep track of cookies in the same way that the browser does, sending them back to the appropriate sites when requests are made.

Blocking Automated Scripts

Given the ease with which we are able to write scripts to automate web requests, it would seem to be difficult to ensure that any poll or competition on the web could be run fairly. An application running such a service would need to take steps to limit automated responses. Some examples of how to do this are listed here, along with ways to get around them:

In the end it is an arms race between the people running the service and the people trying to automate submissions. Ultimately anything that can be done in a browser can probably be done by a script so we should be very careful about what transactions are allowed over the net and the level of trust we can place on them.