Welcome to the Software Carpentry screencast on Databases. This
screencast is an introduction to databases.
In this screencast we’ll try to give you a sense of what databases are and
what you can do with them.
What is a database?
A database is a way to store and manipulate information that is arranged
as tables.
Each table has columns (also known as fields) which describe the data, and
rows (also known as records) which contain the data.
This table you see here is a log of all of the work done on experiments in
a research lab, broken down by project and scientist
Each record in the table describes an Experiment. There is a field for
loginID of the scientist running the experiment, the project, experiment
name, hours spent on the experiment, and date the experiment was run.
Speaking Database: SQL
In a spreadsheet, you insert formulas or new sheets to analyse your data.
In a database, you give commands, also known as Queries, the database does
the analysis your query specifies, and returns the results in a tabular
form.
Queries you give are written in a simple, english-like, language
called SQL, which stands for “Structured Query Language”
Dataflow: the building blocks of queries.
SQL is a vast language that provides all sorts of ways of mixing and
remixing your data.
In this lecture we’ll assume you already have a database, and so we’ll
only be discussing queries that extract and analyse data.
Broadly speaking though, there are only six basic types of operations you
can do with a database.
Each operation extracts or transforms data from a previous operation, to
form a pipeline or flow of data that ends with the results you get back.
We’re going to give you a preview of the basic database operations now in
order to give you a flavour of what databases can do.
In the other database screencasts we’ll show you this all in much more
detail, along with showing you how to accomplish common analysis tasks.
SELECT
So, let’s begin with the data we saw in the first slide.
The SELECT operation simply extracts various fields from a table or from
the results of other another query.
This is obviously a fundamental operation: if you want to get data out of
the database you’ll need to use the SELECT operator to tell the database
where to find the data you’re interested in.
APPEND
The APPEND operation adds new, calculated fields to the results, like
rounding the values in another column.
SORT
the SORT operation orders the results by the values in a field.
FILTER
The FILTER operation chooses records to include based on TRUE or FALSE
conditions
So, here is an illustration of a query that filters for records that have
Hours values of 5 or greater.
AGGREGATE
the AGGREGATE operation summarises groups of records into new records.
This is useful for calculating the SUM, MAX, MIN, Average or a count of
records that share a unique set of fields.
Here we see the Hours being totalled for each project.
JOIN
Suppose we had a Person table that shows more details on each of the
scientists.
The JOIN operation joins two (or more) tables together by combining
records based on TRUE or FALSE conditions.
We could use this operation to return results from the Experiment table
with the full names of the corresponding scientists from the Person table:
Why/when use a database?
A database is the right tool for managing complex and structured data,
that is spread over many different tables.
Databases are also designed to work quickly on very large data sets –
much larger than can comfortably be managed with a spreadsheet.
Queries allow for great flexibility in how you are able to analyse your
data. This makes databases a good choice for if you need to explore or
mix and remixing it in many different ways.
Unlike spreadsheets, databases do not typically have built in charting and
visualising tools. But, it’s always possible to export the results of a
query to be used in other tools.
Conclusion
So, summing up:
A database is a set of tables of data on which you can explore and
manipulate using queries.
We’ve seen six basic kinds of operations that you can do on the data in a
database: SELECT, APPEND, FILTER, AGGREGATE, JOIN, SORT.
In this screencast we’ve tried to give you a flavour of the sorts of
remixing you can do by showing you the basic operations.
Databases are the right tool for managing large amounts of data, or data
that is complex. They are useful when you need to remix and explore your
data in different ways.