Databases
April 24, 2010: We are pleased to announce that Version 4 of this course is now under development. For updates and an early peek at the content, please check out the Software Carpentry blog at http://www.software-carpentry.org/blog/.
1) Introduction
- Text and XML have their place, but most of the data people really care about is stored in relational databases
- The bad news: it's a huge topic
- The documentation for most commercial databases would fill the entire room
- No matter what room you're in
- The good news: you only need to know a little to get most things done
- A few key ideas
- A little syntax
- [Fehily 2003] is a good tutorial and reference guide

2) You Can Skip This Lecture If...
- You know what a table is
- You know how to select data from a table
- You know how to aggregate data
- You know what a nested query is
- You know what primary and foreign keys are
- You know how to do inner and outer joins

3) History
- Originated with E. F. Codd's work in the late 1960s and early 1970s
- By the 1980s, [Oracle] and IBM's [DB2] dominated the market
- Open source alternatives like [MySQL] and [PostgreSQL] emerged in the 1990s
- Now have commercial support and competitive performance
- [SQLite] is a lightweight alternative for small jobs
- Originally designed for use in small web sites
- Stores entire database in a single file on disk (which simplifies backup and recovery)

4) When To Use A Database
- When you have lots of data
- When you need to ask complex questions
- E.g., "Find all experiments done with the Mark VII that had yields greater than 30%, that didn't use cadmium disulfide as a reagant"
- Relational database can answer these questions directly
- Many people try to use spreadsheets as simple databases
- Works for small data sets like course grades
- But they don't scale up, and search capabilities are primitive
- Increasingly common to store images, video clips, and other data
- Almost always store information about this data as well to support search and retrieval

5) Getting Started
- A database is a collection of zero or more tables, each of which:
- Has a name
- Stores a single relation (i.e., a set of information of a particular kind)
- Each table has a fixed set of named columns
- All the values in a column have the same type
- Each table has zero or more rows

6) Using SQL
- Interact with database management system (DBMS) using a specialized language called SQL
- Every vendor implements its own extensions to the standard
- Not case sensitive:
gravity, Gravity and GRAVITY are considered the same
- Three approaches:
- Use an interactive GUI
- Put commands in a file, and give it to the DBMS
- E.g.,
sqlite experiments.db < find_names.sql
- Have a program written in another language (such as Python or Java) send strings containing commands to the database manager

7) Creating Tables
- To create a table, specify its name, and the names and types of its columns
CREATE TABLE Person(
Login TEXT,
LastName TEXT,
FirstName TEXT
);
- To erase a table, use
DROP TABLE name
- Remember: back up early, back up often...
- To insert values into a table, specify the name of the tables, and the values to be inserted
- Each
INSERT creates a new row
- Rows do not have to be unique
INSERT INTO Person VALUES("skol", "Kovalevskaya", "Sofia");
INSERT INTO Person VALUES("mlom", "Lomonosov", "Mikhail");
INSERT INTO Person VALUES("dmitri", "Mendeleev", "Dmitri");
INSERT INTO Person VALUES("ivan", "Pavlov", "Ivan");

8) Simple Queries
- Suppose we want to get everyone's name and login ID
- Write a query that specifies what we want, and where to find it
SELECT Person.FirstName, Person.LastName, Person.Login FROM Person;
Sofia|Kovalevskaya|skol
Mikhail|Lomonosov|mlom
Dmitri|Mendeleev|dmitri
Ivan|Pavlov|ivan

9) Sorting
- How about sorting rows by login ID?
SELECT Person.FirstName, Person.LastName, Person.Login
FROM Person
ORDER BY Person.Login;
Dmitri|Mendeleev|dmitri
Ivan|Pavlov|ivan
Mikhail|Lomonosov|mlom
Sofia|Kovalevskaya|skol
- Note: some SQL commands are multi-word, such as
ORDER BY

10) Selection
- Frequently want only a subset of data
SELECT Experiment.ProjectId, Experiment.ExperimentId, Experiment.Hours
FROM Experiment
WHERE Experiment.Hours < 0;
1737|1|-1.0
1737|2|-1.5
- Use
WHERE to specify conditions that rows must satisfy to be included in results
- Works on each row independently: cannot be used to compare one row to another

11) Joins
- Project IDs aren't particularly readable
- Want to look up the corresponding names, and display those instead
- A join is a query that combines information from two or more tables
- Most common kind is an inner join, which matches rows of the first with rows of the second based on common values
- Other variants include cross join, outer join, and self join
- Conceptually, an inner join:
- Constructs the cross product of the tables
- Discards rows that don't meet the selection criteria
- Selects columns from the surviving rows

12) Example: Translating IDs
- Rewrite the previous query to replace project IDs with names
SELECT Project.ProjectName, Experiment.ExperimentId, Experiment.Hours
FROM Project INNER JOIN Experiment
WHERE (Project.ProjectId = Experiment.ProjectId)
AND (Experiment.Hours < 0);
Time Travel|1|-1.0
Time Travel|2|-1.5
- What just happened:
- Construct cross product of
Project and Experiment (which has 3x6=18 rows)
- Throw away rows for which the project IDs (reduces data to 6 rows)
- And rows for which hours are not negative (reduces data to 2 rows)
- Show project name, experiment ID, and experiment hours

13) Keys and Constraints
- One or more values in each record form its primary key
- A table may also contain one or more foreign keys
- A value (or set of values) in one table that identifies a record in another
- For example, the values in the
Login column in Involved identify records in the Person table
- Can (and should) specify such constraints explicitly, so that the DBMS can enforce them
CREATE TABLE Person(
Login TEXT NOT NULL,
LastName TEXT NOT NULL,
FirstName TEXT NOT NULL,
PRIMARY KEY (Login)
);
CREATE TABLE Experiment(
ProjectId INTEGER NOT NULL,
ExperimentId INTEGER NOT NULL,
NumInvolved INTEGER NOT NULL,
ExperimentDate DATE,
Hours REAL NOT NULL
CONSTRAINT Experiment_Key PRIMARY KEY (ProjectId, ExperimentId)
);

14) Eliminating Duplicates
- How to find out who has done experiments for each project?
SELECT Project.ProjectName, Involved.Login
FROM Project, Involved
WHERE Project.ProjectId = Involved.ProjectId;
Antigravity|mlom
Antigravity|mlom
Teleportation|dmitri
Teleportation|skol
Teleportation|ivan
Teleportation|mlom
Time Travel|skol
Time Travel|skol
Time Travel|ivan
- User
mlom appears twice for the Antigravity project because he did two experiments for it
- Use the
DISTINCT keyword to eliminate duplicates
SELECT DISTINCT Project.ProjectName, Involved.Login
FROM Project, Involved
WHERE Project.ProjectId = Involved.ProjectId;
Antigravity|mlom
Teleportation|dmitri
Teleportation|skol
Teleportation|ivan
Teleportation|mlom
Time Travel|skol
Time Travel|ivan

15) Aggregation
- Often need to aggregate (combine) values from different rows
- Sum, maximum, average, etc.
- Example: how much time has Mikhail spent on antigravity experiments?
SELECT SUM(Experiment.Hours)
FROM Involved INNER JOIN Experiment
WHERE (Involved.Login = "mlom")
AND (Involved.ProjectId = 1214)
AND (Involved.ProjectId = Experiment.ProjectId)
AND (Involved.ExperimentId = Experiment.ExperimentId);
15.8

16) Grouping
- It would be tedious to write a separate query to total each scientist's hours
- SQL doesn't have loops
- Although some vendors provide non-standardized equivalents
- Use
GROUP BY to apply aggregation function to specific subsets of rows
SELECT Involved.Login, SUM(Experiment.Hours)
FROM Involved INNER JOIN Experiment
WHERE (Involved.ProjectId = Experiment.ProjectId)
AND (Involved.ExperimentId = Experiment.ExperimentId)
GROUP BY Involved.Login;
dmitri|7
ivan|5.5
mlom|23.0
skol|4.5
- Note: negative hours on time travel experiments really mess up budgeting...

17) Self Joins
- How to find people who have done experiments for two (or more) projects?
- First attempt: use
AND
SELECT DISTINCT Person.Login
FROM Person INNER JOIN Involved
WHERE (ProjectId = 1214) AND (ProjectId = 1709);
- Doesn't work because
ProjectID cannot simultaneously be 1214 and 1709
- Second attempt: use
OR
SELECT DISTINCT Person.Login
FROM Person INNER JOIN Involved
WHERE (ProjectId = 1214) OR (ProjectId = 1709);
skol
mlom
dmitri
ivan
- Doesn't work because it includes rows where information about different people has been joined
skol
mlom
dmitri
ivan

18) Using Self Joins
- Right solution that works is to join the
Involved table with itself, so that we have two project IDs in the same row
- Then select rows where the person is the same, but the project IDs are different
- Have to create a temporary alias for the two versions of the tables
SELECT DISTINCT A.Login
FROM Involved A CROSS JOIN Involved B
WHERE (A.Login = B.Login)
AND (A.ProjectId != B.ProjectId);
mlom
skol
ivan

19) Who Has Worked Together?
- Which pairs of people have performed experiments together?
SELECT DISTINCT A.Login, B.Login
FROM Involved A CROSS JOIN Involved B
WHERE (A.ProjectId = B.ProjectId)
AND (A.ExperimentId = B.ExperimentId)
AND (A.Login != B.Login);

20) Null
- Real-world data always has holes in it
- Some people don't have cell phone numbers, some authors' birth dates are unknown...
- Can represent this in a database using the special value
NULL
NULL is not the same as zero, empty string, False, etc.
- Instead, it means "nothing known at all"
- Database designers argue about whether
NULL is a good idea or not
- Does it mean "no value", "value not known", or something else?

21) Operations on Nulls
- Check to see if a value is null using
IS NULL
- The result of any computation involving
NULL is NULL
2 + NULL is NULL, NULL OR True is NULL, etc.
- Although in some databases,
False AND NULL is False, and True OR NULL is True

22) Managing Nulls
- By default, columns may contain
NULL, but this can be prohibited when the table is created
- Queries must take possibility of
NULL into account
Experiment.ExperimentDate <> 1901-05-01 selects all experiments that weren't conducted on May 1, 1901, and all experiments whose date is NULL (since NULL isn't equal to anything except itself)
- Have to use
(Experiment.ExperimentDate <> 1901-05-01) AND (Experiment.ExperimentDate IS NOT NULL)

23) Database Design
- Database design is a sizeable topic in its own right
- Make sure the relationships are correct
- Make sure the database performs well
- Very dependent on exactly which vendor database is being used
- But all commercial-grade DBMSes have powerful optimizers
- Most important thing is to normalize the data

24) First Normal Form
- First normal form: values do not have any internal structure
- I.e., you shouldn't have to parse them in order to use them
- This is why first names and last names are stored separately

25) Second Normal Form
- Second normal form: tables don't contain redundant information

26) Nested Queries
- How to find everyone who hasn't been experimenting with time travel?
- Select rows of
Involved where ProjectID is not 1737?
- Wrong answer: Kovalevskaya and Pavlov have worked on time travel, but also on other projects
SELECT DISTINCT Involved.Login
FROM Involved
WHERE (Involved.ProjectId != 1737);
mlom
dmitri
skol
ivan
- Solution requires use of nested queries
- Database manager runs the inner query first, then applies the outer query to the inner query's result

27) Nested Query Example
- Strategy:
- Nested query finds all the people we don't want
- Outer query subtracts them from the set containing everyone
SELECT DISTINCT Login
FROM Involved
WHERE Login NOT IN
(SELECT DISTINCT Login
FROM Involved
WHERE Involved.ProjectId = 1737);
mlom
dmitri

28) More Uses for Nested Queries
- This strategy is useful for many other things as well
- Example: how many people have done experiments for exactly one project?
- Solution: find the people who've done experiments for none, or for two or more, and subtract them from everyone
SELECT DISTINCT Login
FROM Involved
WHERE Login NOT IN
(SELECT DISTINCT A.Login
FROM Involved A INNER JOIN Involved B
WHERE (A.Login = B.Login)
AND (A.ProjectId != B.ProjectId));
dmitri

29) Using Other Languages
- Usually don't write entire application in SQL, or run SQL in sub-shell
- Instead, embed SQL in the programming language of your choice
- Need the right driver to connect to the database
- Establish a connection between the program and the DBMS
- Typically a socket, but other methods are used as well
- Create a pointer into the database called a cursor
- Send queries, and loop over results

30) Example: Database Access from Python
- Example: get the names of all the scientists into a Python program
from pysqlite2 import dbapi2 as sqlite
connection = sqlite.connect("example.db")
cursor = connection.cursor()
cursor.execute("SELECT FirstName, LastName FROM Person ORDER BY LastName;")
results = cursor.fetchall();
for r in results:
print r
cursor.close();
connection.close();
("Sofia", "Kovalevskaya")
("Mikhail", "Lomonosov")
("Dmitri", "Mendeleev")
("Ivan", "Pavlov")

31) Concurrency
- The biggest challenge in database programming isn't formulated queries---it's handling concurrency
- Two or more things happening at once
- In the database world, one user changing the database while another is making a query
- Need to prevent race conditions, in which the final state of the system depends on the random order of operations
- Also need to guard against failure
- Step 1: remove $100.00 from grant #19823
- Step 2: add $100.00 to grant #17928
- Don't want money to disappear if computer goes down in between

32) Transactions
- Solution to both problems is to use a transaction
- A set of operations which either all take effect as if nothing else was going on, or do not change the database
- Transactions must be ACID:
- Atomic: either all are performed, or none
- Consistent: database is in a legal state when the transaction ends
- Isolated: no operation outside the transaction sees the database in any intermediate state
- Durable: once the user is notified that the operation has completed, its effects are permanent

33) Example: Changing User ID
- Change Kovalevskaya's login ID from
"skol" to "kovalev"
BEGIN TRANSACTION;
UPDATE Person
SET Login = "kovalev"
WHERE Login = "skol";
UPDATE Involved
SET Login = "kovalev"
WHERE Login = "skol";
END TRANSACTION;
SELECT *
FROM Person
WHERE (Login = "kovalev") OR (Login = "skol");
SELECT *
FROM Involved
WHERE (Login = "kovalev") OR (Login = "skol");
kovalev|Kovalevskaya|Sofia
1709|1|2|kovalev
1737|1|1|kovalev
1737|2|1|kovalev
- No query can run after
Person changes, but before Involved changes
- If database goes down in the middle, any changes made are discarded

34) Using Transactions
- Transactions can be used for queries, but should always be used for updates
- Why not use transactions everywhere?
- Because they require the database to serialize some operations
- Which slows the system down

35) Testing
- Unit testing programs that use databases is just like testing other programs...
- ...except slower
- Creating a fresh fixture for each test means erasing and re-creating the entire database
- If it takes two seconds to run each unit test, developers are not going to re-run 1000 tests after each small program change
- Solutions:
- Create the fixture once, and clone it each time it's needed
- Still kind of slow
- Store the database in memory, rather than on disk
- Not very useful in a production system (since all data is lost when the program ends)
- But much, much faster

36) Advanced Topics
- A stored procedure is a piece of compiled code stored in the database itself
- Trades flexibility for efficiency
- A trigger is a procedure that automatically runs when a table is modified
- E.g., send mail whenever a new dataset is entered
- And then there's the problem of ensuring referential integrity
- I.e., that references within and between database entries are consistent

37) Summary
- The world isn't made of tables any more than it's made of lists
- But sooner or later, every home-grown spreadsheet or XML substitute needs the capabilities of a DBMS
- Even simple queries can get you into trouble if they miss important data
- Or include data they shouldn't
- Treat queries written in SQL with the same respect you'd give any other program
