So far we've looked at the core Python language, everything we've seen is built into Python. One of the major features of Python though is the very broad range of modules that are shipped with the standard distribution. These modules do everything from cryptography to parsing web pages and sending email and as you learn to develop in Python you will learn more about their capabilities.
The Python standard library is documented in the Module Index section of the Python manual. You'll find chapters on all of the modules that include examples of how to use them; it's well worth skimming through it and exploring anything that looks interesting so that when you're working on a problem, you remember that there is a relevant module. In this course we'll only use a few modules and probably only a small part of each of them.
In some cases, the problem you have isn't solved in the standard library but someone has published a third party module that does the job. The Cheese Shop is a register of third party modules and installing them is easy thanks to the pip tool that will download, build and install a module from the command line.
To use a module in your application, you need to import it into your
program. You do this with the
import statement before you use anything
from the module. For example, to use the
listdir procedure from the
os module (which returns a list of files in a directory) we would
import os print(os.listdir("."))
(The argument to
listdir is the directory to list, I gave it "." which
stands for the current directory). Note the dot notation between the
os and the procedure name. The dot is used in Python as a
general way to refer to components of objects, we saw it when we looked
at strings and sequences (remember
In this case the we're referring to a procedure inside a module.
Sometimes you'll see multiple dots because modules can be nested, e.g.
Sometimes you just want one procedure from a module, in this case you
from xxx import yyy notation. This allows you to use the raw
name of the procedure without the module name:
from os import listdir print(listdir("."))
You can import more than one name:
from os import listdir, getcwd print("Current directory:", getcwd()) print(listdir("."))
You can even import all the exported names from a module
from os import *). However this is discouraged because it generally
means that you didn't think it through - the danger is that you import a
name that clashed with something in your own program.
Finally, you can change the names that you import. You might do this if the name you were importing might clash with another name in your program of from another module.
from os import listdir as showMeTheFiles print(showMeTheFiles("."))
The rest of this chapter will briefly describe some useful Python modules.
This module contains various procedures for accessing operating system services in a platform independant manner. For our purposes, the main things we will use are to access the file system although the module also provides access to running processes. A few useful procedures are:
os.listdir(dirname)returns a list of the names of the files in the directory
os.getcwd()returns the current working directory that your script is running from.
os.mkdir(dirname)creates a new directory with the given name
os.path provides many procedures to help manipulate file
names (path names including the directory names) in a platform neutral
os.path.join(dirname, filename)joins the two names together with the right directory separator for the platform, that is, use a forward slash (/) on Mac and Linux and a backward slash (\) on Windows. You should use this to manipulate path names so that your code will work on any platform.
os.path.split(pathname)returns a tuple (head, tail) where
tailis everything after the last directory separator and
headis everything before it.
os.path.basename(pathname)returns the last part of the path name, after the last directory separator.
os.path.exists(filename)returns True if the filename is an existing file, False if not.
os.path.isdir(name)returns True if the name is an existing directory, False otherwise.
re Module: Regular Expressions
Regular expressions are patterns that match all or part of a string,
they are very useful in any kind of text processing application and
every programmer should understand how and when to use them. The Python
re module provides support for various operations using regular
expressions, this section gives some details of the most useful parts of
A regular expression is a pattern specified in a special language where
certain symbols affect the way that strings are matched. For example,
the '.' character will match any character in a string, so the regular
'...' will match any sequence of three characters, whatever
they are. The full language is quite complicated but the most useful
parts are simple to master. We'll first look at a simple example pattern
and show some of the things you can do with it, then describe the most
useful parts of the regular expression language.
If I want to find all of the dollar amounts in a text I can write a simple regular expression that matches things like $4,000, $3, $200 etc. (for simplicity, I'm ignoring amounts with a decimal point). Basically I'm looking for a dollar sign, followed by one or more digits, possibly with commas. That last sentence gives a model for the regular expression I need to write, I just need to encode that in the regular expression language:
- A dollar sign:
'\$'the dollar sign is special in the regular expression language, it means match at the end of a string, so we need to preceed it with a backslash to just match a dollar.
- followed by one or more digits possibly with commas:
'\$[0-9,]+'The square bracket notation matches any character from the set in brackets. In this case I include the digits 0-9 (shorthand for including all the digits) and the comma character. The + sign after the brackets means do this one or more times.
So the final pattern we have is
'\$[0-9,]+', this should match any
dollar amount in a text. Let's see how to apply this to search for
occurrences of dollar amounts in a text.
re.search procedure finds matches to a pattern in a string. The
returned value from the
re.search method is a match object which
contains details of the matching text. If there is no match then it
None. So, the first use of this procedure is to simply test
whether the pattern is present or not, since None evaluates to False and
the match object to True in a test, we can write:
import re def string_has_money(text): """Return True if this text string contains a dollar amount, False otherwise.""" dollar_re = r'\$[0-9,]+' if re.search(dollar_re, text): return True else: return False
Next I might want to find what the dollar amounts are in a string. For
this I need to interrogate the return value of the
procedure. As I said, this is an object and we can call the
method to find what was matched:
import re def get_money_from_string(text): """Return the first dollar amount from the text string or None if none is found.""" dollar_re = r'\$[0-9,]+' match = re.search(dollar_re, text) if match: return match.group() else: return None
I might also want to find all of the matches to this pattern. There are
two ways to achieve this. The procedure
re.findall returns a list of
strings that match the pattern and the procedure
you to iterate over the matches with a for loop. So, I can find all
dollar amounts like this:
import re def get_all_money_from_string(text): """Return a list of the dollar amounts from the text string or the empty list if none is found.""" dollar_re = r'\$[0-9,]+' return re.findall(dollar_re, text)
Finally, I might want to censor a text by replacing any dollar amount
with $$$$CAPITALISM_SUCKS!!!!. I could do this with the
import re def censor_text(text): """Replaces any dollar amount in the text string with a suitable anti-capitalist message. Returns the resulting string.""" dollar_re = r'\$[0-9,]+' message = "$$$$CAPITALISM_SUCKS!!!!" return re.sub(dollar_re, message, text)
These are just a few of the things you can do with the regular
expression library. In particular the match object that is returned by
re.search has many more capabilities that are useful for more
More on Regular Expression Patterns
Here are the most useful parts of the regular expression language. A more complete reference can be found in the Python documentation.
- Matches any character except newline
- Used to indicate a set of characters and matches any one of those characters. You can include ranges like [a-z], [0-4], [A-F] as well as explicit sets like [abc]. Special characters like . and + lose their meaning inside of square brackets so [0-9.] matches either a digit or a period.
- Matches a decimal digit, equivalent to '[0-9]'
- Matches any whitespace character, space, tab or newline etc.
- Matches any non-whitespace character, the opposite of \s.
- Matches any alphanumeric character or underscore, equivalent to [A-Za-z0-9_].
- A modifier which means that the previous pattern matches zero or more times.
- A modifier which means that the previous pattern matches one or more times.
- A modifier which means that the previous pattern is optional (matches zero or one time).
- Regular expressions can be used to locate HTML tags in a web page.
When a search engine is creating an index of a web page, it must
remove the tags to leave just the words behind. Write a procedure
remove_htmlwhich takes a string containing HTML text and returns the text with all of the HTML tags removed. Remember that HTML tags can be upper or lower case and can opening tags can contain attributes.
- Another requirement for a search engine processing a web page is to extract all of the URLs contained in a page. Write a procedure that uses a regular expression to match all of the anchor tags (<a href="http://www.example.com/">) and returns a list of the URLs found. Assume that anything inside of the href attribute is a URL but note that either single or double quotes or no quotes at all are allowed around the attribute value.
Copyright © 2009-2012 Rolf Schwitter, Steve Cassidy, Macquarie University
Python Web Programming by Steve Cassidy is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.