Aggregation
Welcome
- Welcome to the Software Carpentry screencast on Databases.
- In this screencast, we will show you how to combine values from multiple
rows. - This is called “aggregation”
Aggregation Functions
- Suppose we wanted to know how many hours were spent on all experiments
so far - First, we need to fetch all the hours in the table, which can write like this:
SELECT Hours FROM Experiment;
- To add up all the hours, all we need to do is apply the SUM function to
the Hours column. So we write,SELECT SUM(Hours) FROM Experiment;
- SUM is one of the aggregation functions in SQL
- In their simplest form, aggregation functions are applied to all rows
fetched by a query and they reduce the result to a single row - MAX, MIN and AVG are also aggregation functions
SELECT SUM(Hours), MAX(Hours), MIN(Hours), AVG(Hours) FROM Experiment;
- You can use them like SUM and they’ll do what you expect: compute the
maximum, minimum, and average of your query results. - Another handy aggregation function is COUNT. If you would like to know
how many records are returned by a query,SELECT COUNT(*) FROM Experiment;
The use of the asterisk, or *, here is an idiom that is used because we
are interested in counting the number of records, not of counting
anything in a particular column.
Aggregation Pitfalls: Default aggregation
- What if we want the total number of hours each scientist has worked so
far? - One way is to use a WHERE clause to single out specific scientists. We
would write,SELECT SUM(Hours) FROM Experiment WHERE LoginID = 'mlom';
- The problem with this approach is that we have to write one query for
each scientist. We only have a few scientists in our table, but imagine
if we had hundreds. - What we want to be able to is have the database return a row for each
scientist, and include a sum of the hours they have worked, like so:SELECT LoginID, SUM(Hours) FROM Experiment;
But this query returns a single row, not one for each scientist. And
why does it return that particular LoginID? - Let’s take a look at the same query without the SUM aggregate function:
SELECT LoginID, Hours FROM Experiment;
- When we used SUM, the database was collapsing these rows by summing the
Hours column, but since we haven’t specified a aggregation function for
LoginID, the database just picks an arbitrary LoginID and returns it. - If your query selects fields directly from a table and aggregates at the
same time, the values for unaggregated fields can be any value in the
records being aggregated.
Grouping
- So, if we want the total number of hours each scientist has worked so
far we need to tell the database to aggregate the hours for each
scientist separately. - We do this by using the GROUP BY clause.
SELECT LoginID, Hours FROM Experiment GROUP BY;
We want our aggregation done for each scientist, so we group by the
LoginID,SELECT LoginID, Hours FROM Experiment GROUP BY LoginID;
and for each group we display the loginID and the SUM of the hours of
that group:SELECT LoginID, SUM(hours) FROM Experiment GROUP BY LoginID;
- So, this query tells the database to group all of the rows that have the
same LoginID together, and then do the aggregation for each of those
groups separately. - Since we’re grouping by LoginID, we know that the rows in each group
have the same LoginID, so it’s safe to select the LoginID column here
without getting unexpected results. - If we want, we can group by multiple criteria at once. So, for example,
if we wanted the number of hours each scientist had spent on each
project, we would group by both loginID and Project:... GROUP BY LoginID, Project;
And then add the ProjectID to the SELECT clause so that we know which
project the hours belong to:SELECT LoginID, Project, SUM(Hours) FROM Experiment ...
- The GROUP BY clause here specifies that all of the rows that have the
same LoginID and Project name are grouped together, and then the
sum is done for each of those groups. - The other aggregation functions work in the same way. So, for instance,
to calculate how many experiments each scientist has done for each
project, we add COUNT to our SELECT clause,SELECT LoginID, Project, SUM(Hours), COUNT(*) FROM Experiment GROUP BY LoginID, Project;
Sorting
- Sorting and filtering can also be done on queries that aggregate data.
- For example, If we want the total time spent on each project sorted by
project name, we would write:SELECT Project, SUM(Hours) FROM Experiment
to select the project name and total number of hours worked, and we
want this for each project,GROUP BY Project
and we want to sort the results by the project name
ORDER BY Project ASC;
- The ORDER BY clause always goes after the GROUP BY clause,
because we are ordering /the results/ of aggregation — it wouldn’t make
any difference to order the data before it was aggregated. - What if we wanted to sort the results by the number of hours spent?
- Instead of using a plain field to sort on, like Project, we can use an
aggregate function as our sorting criterion.ORDER BY SUM(Hours) ASC;
- So, here we are sorting the results, not by a field from the Experiments
table — the raw data — but by the results of an aggregation function
computed on the data from the Experiments table.
Filtering
- What if you want to remove the negative hours, and only add up the
positive values? - You can do this by adding a WHERE clause to our query to filter out
values you don’t want before they are grouped and aggregated - In this case, the query is
SELECT Project, SUM(Hours) FROM Experiment WHERE Hours >= 0 GROUP BY LoginID, Project ORDER BY SUM(Hours) ASC;
- Notice that we can read the query as a series steps in the same order
they are written. First, the data is selected from the table, those
results are filtered by the WHERE clause, then what is left over is
aggregated, and finally the aggregated results are sorted.
Conclusion
- This query nicely summarises everything we’ve covered in this lecture.
- In this screencast we’ve demonstrated:
- how aggregation functions, like SUM and COUNT, can be used to
perform calculations on multiple rows - how to group your data and aggregate over those groups using the
GROUP BY clause - and how filtering and sorting can also be used in conjunction with
aggregation.

Hi,
Is SQL case-sensitive?
Because in this course it can be seen that ” LoginID= ‘best’ ” performed one project (‘Insulin’ or ‘insulin’) but it is considered as different projects.
Thanks
@Javad
SQL is case-sensitive when matching strings, but it is case-INsensitive for the other parts of the query.
For example, you could write the query “sEleCt * fRoM exPeriMent;” and it would run just fine.
But when you are looking at the actual values in the table, the case matters. So something like
select * from experiment where loginid=”IVAN”;
will return no results, since there is no one with the LoginID “IVAN”. Changing it to
select * from experiment where loginid=”ivan”;
will get you the result you want.