Databases
Following along will be easiest if you have the IPython.lib.demo module.
The demo module should exist in the current enthought distribution of ipython (v.0.11). However, it seems to not be in the distribution most people have downloaded. Check by typing ‘from IPython.lib.demo import Demo’ into ipython. If you have a file called demo.py in …/python2.7/site-packages/IPython/lib/demo.py , then you have the demo functionality you need. If not, please make sure that you have such a file in such a place (on a Mac, this is likely to be /Library/Frameworks/Python.framework/Versions/7.1/lib/python2.7/site-packages/IPython/lib/demo.py) . Here’s a link to the demo.py file you need: http://ipython.scipy.org/svn/ipython/ipython/tags/rel-0.8.0/IPython/demo.py
What are databases?
Databases have Tables, and each table has Records and Attributes (Rows and Columns).
Rows Columns – >
|
v
Records Fields – >
|
v
Databases come in numerous formats.
Relational
In relational databases, each record is represented by an integer primary key. The type of database we’ll be working with today, SQL, is typically used with relational type database formats.
Rows are called tuples, columns are called attributes, and a table is called a relation.
Hierarchical
In hierarchical databases, each dataset can contain other datasets, tables, and attributes, so that data can be arranged in a hierarchical fashion. The canonical type of hierarchical database is HDF5. Interestingly, with a non-standard keyword “CONNECT”, SQL can be given the capability to query information hierarchically, this was not what it was designed for.

Hierarchical databases have datasets held in dataspaces within groups.
Network
In network databases, datasets are nodal, with shared entries as connections. This can be thought of, on a fundamental level, as a particular type of relational database.
Why use databases?
The limitations of spreadsheets are many.
- accidental sorting
- size limitations (row/column length slowdown)
- unscriptable querying, manipulating, analyzing
- interface limited
- format portability issues
The advantages of databases are also many.
- unlimited size
- independence from interface
- scriptable data retrieval, manipulation
Today we’ll be discussing SQL, the structured query language.
Queries -> Processing -> Table of results
Operations
Selecting - Extracts fields. SELECT <fields> FROM <tablename>
Filtering – selects according to conditional statements : … WHERE <boolean>
Inserting – puts new entries into the database : INSERT INTO <tablename> VALUES (<data>)
Deleting - removes entries from the database : DELETE <fields> FROM <tablename> WHERE <boolean>
Sorting – orders table by value in a field : … ORDER BY <field> (DESC/ASC)
Aggregating - Adds new calculated fields :
- AVG() - Returns the average value
- COUNT() - Returns the number of rows
- FIRST() - Returns the first value
- LAST() - Returns the last value
- MAX() - Returns the largest value
- MIN() - Returns the smallest value
- SUM() - Returns the sum
Joining – Combines tables or rows : SELECT <fields> FROM <table1> JOIN <table2> ON <field>
It is important to import the proper modules. For this, you’ll need to import sqlite3.
import sqlite3
The first task at hand is to connect to the database. In python, we read the database into memory as an sqlite3 database connection object.
db = "ciTests.db"
if not os.path.exists(db) :
raise Exception, "Error: could not find specified database."
dbConn = sqlite3.connect('ciTests.db', detect_types=sqlite3.PARSE_COLNAMES)
dbConn.row_factory = sqlite3.Row
if type(dbConn) != sqlite3.Connection :
raise Exception, "Error, not a valid database connection."
c = dbConn.cursor()
The next task is to issue commands. The first command we’ll discuss is SELECT. The syntax is SELECT <field> FROM <Table> The asterisk * is “regular expression” meaning “everything”.
c.execute("SELECT * FROM PersonTable")
Will produce :
['personID', 'username', 'birthyear', 'birthmonth', 'gender', 'CI'] 1 scaryCarrot 1999 7 M 7 2 madonnaSpice 1998 10 F 14 3 beyonceBroccoli 1998 5 M 15
The table name must match one of the table names in the database
c.execute("SELECT * FROM ResultsTable")
Will result in:
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination'] 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6
We can just select a single field of interest
c.execute("SELECT personID FROM PersonTable")
Will produce :
['personID'] 1 2 3
We can also select numerous fields of interest
c.execute("SELECT personID, username FROM PersonTable")
Which results in:
['personID', 'username'] 1 scaryCarrot 2 madonnaSpice 3 beyonceBroccoli
We can select them out of order
c.execute("SELECT birthyear, personID, username FROM PersonTable")
Which gives:
['birthyear', 'personID', 'username'] 1999 1 scaryCarrot 1998 2 madonnaSpice 1998 3 beyonceBroccoli
We can even select some fields numerous times
c.execute("SELECT birthyear, personID, username, personID FROM PersonTable")
Which gives:
['birthyear', 'personID', 'username', 'personID'] 1999 1 scaryCarrot 1 1998 2 madonnaSpice 2 1998 3 beyonceBroccoli 3
Sorting allows the whole table to be sorted by a single field. The sql keyword for sorting is ORDER BY.
c.execute("SELECT * FROM ResultsTable")
This gives:
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination'] 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6
And it can be stated whether to sort in ASCending or DESCending order. The default is ASCending order.
c.execute("SELECT * FROM ResultsTable ORDER BY personID")
Gives :
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination'] 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6
And:
c.execute("SELECT * FROM ResultsTable ORDER BY personID DESC")
Gives:
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination']
4 ../data/test4.txt 3 0 11 19 26 59 9 8 6
3 ../data/test3.txt 2 0 19 5 33 5 5 8 4
2 ../data/test2.txt 1 0 30 7 46 51 7 7 2
Sorting also allows for sorting the data by one field and then by another
c.execute("SELECT * FROM ResultsTable ORDER BY personID DESC, month ASC")
gives,
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination'] 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2
It is even possible to sort by a field that was not selected
c.execute("SELECT birthyear, personID, username FROM PersonTable")
Gives:
['birthyear', 'personID', 'username'] 1999 1 scaryCarrot 1998 2 madonnaSpice 1998 3 beyonceBroccoli
And
c.execute("SELECT birthyear, personID FROM PersonTable ORDER BY username")
Gives:
['birthyear', 'personID'] 1998 3 1998 2 1999 1
Another thing we can do is filter the results. Filtering allows many operations to be simplified. The SQL filtering keyword is WHERE
c.execute("SELECT * FROM ResultsTable WHERE personID = (?)",(1,))
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination']
2 ../data/test2.txt 1 0 30 7 46 51 7 7 2
Filtering commands are piped together with previous commands
c.execute("SELECT range FROM ResultsTable WHERE personID >= (?)",(1,))
Which gives:
['range']
7
8
8
The AND keyword allows you to filter on numerous fields
c.execute("SELECT * FROM ResultsTable WHERE personID >= (?) AND resultID> (?)",(1,2))
Which gives:
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination'] 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6
Aggregating can summarize many records into new ones can be done by taking a sum:
c.execute("SELECT SUM(discrimination) FROM ResultsTable")
['SUM(discrimination)']
12
It can also be done by taking a count:
c.execute("SELECT COUNT(personID) FROM ResultsTable")
['COUNT(personID)']
3
The JOIN keyword allows many tables to be joined:
c.execute("SELECT * FROM ResultsTable JOIN PersonTable")
This creates a joint table:
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination', 'personID', 'username', 'birthyear', 'birthmonth', 'gender', 'CI'] 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2 1 scaryCarrot 1999 7 M 7 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2 2 madonnaSpice 1998 10 F 14 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2 3 beyonceBroccoli 1998 5 M 15 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 1 scaryCarrot 1999 7 M 7 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 2 madonnaSpice 1998 10 F 14 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 3 beyonceBroccoli 1998 5 M 15 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6 1 scaryCarrot 1999 7 M 7 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6 2 madonnaSpice 1998 10 F 14 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6 3 beyonceBroccoli 1998 5 M 15
Here we only return these joined rows if they share a field
c.execute("SELECT * FROM ResultsTable JOIN PersonTable ON \ PersonTable.PersonID = ResultsTable.PersonID")
Which gives:
['resultID', 'filename', 'personID', 'month', 'day', 'hour', 'minute', 'second', 'volume', 'range', 'discrimination', 'personID', 'username', 'birthyear', 'birthmonth', 'gender', 'CI'] 2 ../data/test2.txt 1 0 30 7 46 51 7 7 2 1 scaryCarrot 1999 7 M 7 3 ../data/test3.txt 2 0 19 5 33 5 5 8 4 2 madonnaSpice 1998 10 F 14 4 ../data/test4.txt 3 0 11 19 26 59 9 8 6 3 beyonceBroccoli 1998 5 M 15

