Standard SQL defines the following aggregate functions: COUNT, SUM, MAX, MIN, AVG (average/mean). Access also defines two others, STDEV (std. deviation) and VAR (variance).
Count can be used for any data: it simply counts the number of rows. MAX and MIN can be used for any kind of simple data (dates, numbers, strings): older dates are "smaller" than more recent ones, and a string beginning with A is "smaller" than one beginning with B. The other functions make sense only with numeric data (though AVG is also meaningful for dates). For example,
SELECT Count(*) from Patients WHERE State='CT'
returns the number of patients based in Connecticut, (* is short for 'number of records' - you could have used any field name instead) while
SELECT Min (DateOfBirth) from Patients
identifies the oldest (smallest) birth date in this table.
The above results could be obtained in Access's user interface without composing a query. The real strength of the aggregate functions is that you can compute grouped aggregates; to introduce grouping, you use the GROUP BY clause. Thus,
SELECT State, Count(*) from Patients Group by State
would give a list of states in the Patients table, with the counts of patients for each state.
You can do multi-level groupings. For example, in a Sales table which recorded salesperson, customer, dateOfSale, city, region, and the amount (dollar value) of each sale, you could do grouping by salesperson, year and month to get monthly performance figures:
Select Salesperson, Year(dateOfSale), Month(dateOfSale), sum(Amount) from Sales group by Salesperson, Year(dateOfSale), Month(dateOfSale)
(The functions Year and Month are part of Access VBA: they operate on a date and return the year and month part of the date respectively- the latter as a number between 1 and 12.)
When you use a GROUP BY, only two kinds of columns can appear in the output:
If you want to combine this information with other information, you must create two queries: one performs the aggregation, the other query includes this one as well as another table (or even the same one), with the other columns that you wish to look at.
For the arithmetic aggregates, you are not limited to operating only on a field. You can use any expression. For example, let's say that you have a OrderItems table in a mail orders database, with the following columns: OrderNumber, ItemNumber (a serial number within the order), a packaging size per unit (in pounds) and the number of units purchased.
(The OrderNumber column is a foreign key into an Orders table, which has information such as Customer, Shipping Address, Date Ordered, etc. The PackagingSize information is copied from a Products table. These details, however, are unimportant for the present query.)
The following query would find the total weight for each order (which is important when calculating shipping costs):
SELECT OrderNumber, sum(PackagingSize * NumberOfUnits) from OrderItems group by OrderNumber
Having is used to restrict the output of a grouping based on a condition; it always involves an aggregate function in the condition. Thus,
SELECT State, Count(*) from Patients group by State having count(*) >= 1000
would eliminate records of states having fewer than 1000 patients.
The expression in the HAVING clause need not necessarily be the same as in the column list. Thus, in the sales example, if we wanted to restrict our output to salespersons making at least 10 sales per month, we would use
Select Salesperson, Year(dateOfSale), Month(dateOfSale), sum(Amount) from
Sales
group by Salesperson, Year(dateOfSale), Month(dateOfSale)
having count(*) >= 10
Let's modify the above sales query to give us performance figures only for sales made in the NorthEast region.
Select Salesperson, Year(dateOfSale), Month(dateOfSale), sum(Amount) from
Sales
where Region = 'NorthEast'
group by Salesperson, Year(dateOfSale), Month(dateOfSale)
having count(*) >= 10
The way the combination works is as follows:
All of these exercises are based on a database on NBA Player Statistics, that you can download from HERE. It has a single table called PlayerData, with aggregate results for 14 well known players (from Bob Cousy to LeBron James), obtained from the Web site www.databasebasketball.com/
Ex. 1: This table does not have a primary key. Create one and make sure that it is correct by trying to save your table. If you get an error the first time, look at the data and determine why this is the case (use the Find Duplicates Query Wizard).
Ex. 2: Find the youngest age at which each player in the table started playing professionally. Determine the youngest player overall.
Ex. 3: Determine each player's career totals in: points, rebounds, steals, blocks, three points made (in a single query).
Ex.3a. Repeat query 3, limiting the output to data where the player was healthy and in his prime (that is, age was between 20 and 38, inclusive, and number of games played per season was at least 50.
More difficult queries (you may require more than one intermediate query):
Ex.3b. Repeat query 3, limiting the results to players who have played for 10 or more years during their career.
Ex 4: Determine, for each player, the career points per game, the career free throw percentage, the career three-point percentage, the career minutes per game. You may encounter strange results for 3 point percentage for some players: figure out why this is happening. From the query, determine the player with the best and worst for each statistic.
Ex. 5: For each player, create a query that determines the year in which he had the best record, in terms of average points per game, and show this average. Sort the players in descending order of this average.
Download the answers database by clicking HERE. (Try to work out the problems before peeking at the answers...)