Up: Databases

Aggregation

Welcome

  • Welcome to the Software Carpentry screencast on Databases.
  • In this screencast, we will show you how to combine values from multiple
    rows.
  • This is called “aggregation”

Aggregation Functions

  • Suppose we wanted to know how many hours were spent on all experiments
    so far
  • First, we need to fetch all the hours in the table, which can write like this:
    SELECT Hours FROM Experiment;
  • To add up all the hours, all we need to do is apply the SUM function to
    the Hours column. So we write,

    SELECT SUM(Hours) FROM Experiment;
  • SUM is one of the aggregation functions in SQL
  • In their simplest form, aggregation functions are applied to all rows
    fetched by a query and they reduce the result to a single row
  • MAX, MIN and AVG are also aggregation functions
    SELECT SUM(Hours), MAX(Hours), MIN(Hours), AVG(Hours) FROM Experiment;
  • You can use them like SUM and they’ll do what you expect: compute the
    maximum, minimum, and average of your query results.
  • Another handy aggregation function is COUNT. If you would like to know
    how many records are returned by a query,

    SELECT COUNT(*) FROM Experiment;

    The use of the asterisk, or *, here is an idiom that is used because we
    are interested in counting the number of records, not of counting
    anything in a particular column.

Aggregation Pitfalls: Default aggregation

  • What if we want the total number of hours each scientist has worked so
    far?
  • One way is to use a WHERE clause to single out specific scientists. We
    would write,

    SELECT SUM(Hours) FROM Experiment WHERE LoginID = 'mlom';
  • The problem with this approach is that we have to write one query for
    each scientist. We only have a few scientists in our table, but imagine
    if we had hundreds.
  • What we want to be able to is have the database return a row for each
    scientist, and include a sum of the hours they have worked, like so:

    SELECT LoginID, SUM(Hours) FROM Experiment;

    But this query returns a single row, not one for each scientist. And
    why does it return that particular LoginID?

  • Let’s take a look at the same query without the SUM aggregate function:
    SELECT LoginID, Hours FROM Experiment;
  • When we used SUM, the database was collapsing these rows by summing the
    Hours column, but since we haven’t specified a aggregation function for
    LoginID, the database just picks an arbitrary LoginID and returns it.
  • If your query selects fields directly from a table and aggregates at the
    same time, the values for unaggregated fields can be any value in the
    records being aggregated.

Grouping

  • So, if we want the total number of hours each scientist has worked so
    far we need to tell the database to aggregate the hours for each
    scientist separately.
  • We do this by using the GROUP BY clause.
    SELECT LoginID, Hours FROM Experiment
    GROUP BY;

    We want our aggregation done for each scientist, so we group by the
    LoginID,

    SELECT LoginID, Hours FROM Experiment
    GROUP BY LoginID;

    and for each group we display the loginID and the SUM of the hours of
    that group:

    SELECT LoginID, SUM(hours) FROM Experiment
    GROUP BY LoginID;
  • So, this query tells the database to group all of the rows that have the
    same LoginID together, and then do the aggregation for each of those
    groups separately.
  • Since we’re grouping by LoginID, we know that the rows in each group
    have the same LoginID, so it’s safe to select the LoginID column here
    without getting unexpected results.
  • If we want, we can group by multiple criteria at once. So, for example,
    if we wanted the number of hours each scientist had spent on each
    project, we would group by both loginID and Project:

    ...
    GROUP BY LoginID, Project;

    And then add the ProjectID to the SELECT clause so that we know which
    project the hours belong to:

    SELECT LoginID, Project, SUM(Hours) FROM Experiment
    ...
  • The GROUP BY clause here specifies that all of the rows that have the
    same LoginID and Project name are grouped together, and then the
    sum is done for each of those groups.
  • The other aggregation functions work in the same way. So, for instance,
    to calculate how many experiments each scientist has done for each
    project, we add COUNT to our SELECT clause,

    SELECT LoginID, Project, SUM(Hours), COUNT(*) FROM Experiment
    GROUP BY LoginID, Project;

Sorting

  • Sorting and filtering can also be done on queries that aggregate data.
  • For example, If we want the total time spent on each project sorted by
    project name, we would write:

    SELECT Project, SUM(Hours) FROM Experiment

    to select the project name and total number of hours worked, and we
    want this for each project,

    GROUP BY Project

    and we want to sort the results by the project name

    ORDER BY Project ASC;
  • The ORDER BY clause always goes after the GROUP BY clause,
    because we are ordering /the results/ of aggregation — it wouldn’t make
    any difference to order the data before it was aggregated.
  • What if we wanted to sort the results by the number of hours spent?
  • Instead of using a plain field to sort on, like Project, we can use an
    aggregate function as our sorting criterion.

    ORDER BY SUM(Hours) ASC;
  • So, here we are sorting the results, not by a field from the Experiments
    table — the raw data — but by the results of an aggregation function
    computed on the data from the Experiments table.

Filtering

  • What if you want to remove the negative hours, and only add up the
    positive values?
  • You can do this by adding a WHERE clause to our query to filter out
    values you don’t want before they are grouped and aggregated
  • In this case, the query is
    SELECT Project, SUM(Hours) FROM Experiment
    WHERE Hours >= 0
    GROUP BY LoginID, Project
    ORDER BY SUM(Hours) ASC;
  • Notice that we can read the query as a series steps in the same order
    they are written. First, the data is selected from the table, those
    results are filtered by the WHERE clause, then what is left over is
    aggregated, and finally the aggregated results are sorted.

Conclusion

  • This query nicely summarises everything we’ve covered in this lecture.
  • In this screencast we’ve demonstrated:
    • how aggregation functions, like SUM and COUNT, can be used to
      perform calculations on multiple rows
    • how to group your data and aggregate over those groups using the
      GROUP BY clause
    • and how filtering and sorting can also be used in conjunction with
      aggregation.

  1. Javad
    November 1st, 2010 at 18:15 | #1

    Hi,

    Is SQL case-sensitive?
    Because in this course it can be seen that ” LoginID= ‘best’ ” performed one project (‘Insulin’ or ‘insulin’) but it is considered as different projects.

    Thanks

  2. Ainsley
    November 2nd, 2010 at 18:56 | #2

    @Javad
    SQL is case-sensitive when matching strings, but it is case-INsensitive for the other parts of the query.

    For example, you could write the query “sEleCt * fRoM exPeriMent;” and it would run just fine.

    But when you are looking at the actual values in the table, the case matters. So something like

    select * from experiment where loginid=”IVAN”;

    will return no results, since there is no one with the LoginID “IVAN”. Changing it to

    select * from experiment where loginid=”ivan”;

    will get you the result you want.

  1. July 6th, 2010 at 00:03 | #1