Performing Result Set Aggregate Calculations

All of the SELECT statements shown thus far have produced detail rows where each row of the result set corresponds to a single row from the table (a base table or table formed from the set of joined tables in the FROM clause). There are often times when you want to perform a calculation on one or more columns from a related set of rows returning only a summary row that includes the calculation result. The set of rows over which the calculations are performed is called the aggregate. The SELECT statement GROUP BY clause is used to identify the column or columns that define each aggregate—those rows that have identical group by column values. Five built-in aggregate functions are provided in SQL as defined in the table below.

Built-in Aggregate Functions Table
Function Description
COUNT Returns the number (distinct) of rows in the aggregate.
SUM Returns the sum of the (distinct) values of expression in the aggregate.
AVG Returns the average of the (distinct) values of expression in the aggregate.
MIN Returns the minimum expression value in the aggregate.
MAX Returns the maximum expression value in the aggregate.
EVERY Returns true if cond_expr is true for every row in the aggregate.
ANY/SOME Returns true if cond_expr is true for at least one row in the aggregate.
INVAR Checks that the value of expressions is invariant (i.e., does not change) within each aggregate.
VAR_POP Computes the population variance over each aggregate.
var_samp Computes the sample variance over each aggregate.
stddev_pop Computes the population standard deviation over each aggregate.
stddev_samp Computes the sample standard deviation over each aggregate.

The complete syntax for the SELECT statement including GROUP BY is as follows.

select_stmt:
          SELECT [DISTINCT] select_item[, select_item]...
               FROM table_ref [, table_ref]…  
               [WHERE conditional_expr]
               [GROUP BY sort_col [, sort_col]… [HAVING conditional_expr] ] 
               [ORDER BY sort_col [ASC | DESC] [, sort_col [ASC | DESC]]…]
table_ref:
          table_spec | table_join
sort_col:
          num | column_var
table_join:
          table_ref NATURAL [INNER | {LEFT | RIGHT} [OUTER]] JOIN table_primary
     |    table_ref [INNER | {LEFT | RIGHT} [OUTER]] JOIN table_primary
               [USING ( column_name[, column_name]...) | ON conditional_expr]
select_item:
          named_expr
     |    [table_name.]*
named_expr:
          [HIDE] {column_varvalue_expr} [[AS] alias_name]
arith_expr:                                              /* involving only numeric operands and operations */
          operand [{+ | - | * | /} operand]…
operand:
          constant | column_var | numeric_function | aggregate_fcn
     |    {+ | -} operand 
aggregate_fcn:
          calc_fcn_name ( [DISTINCT] arith_expr ) 
     |    COUNT{* | column_var })
     |    {EVERY | ANY | SOME} ( conditional_expr )
     |    {MIN | MAX | INVAR} (value_expr)
     |    agg_udf_name( [DISTINCT] value_expr)
calc_fcn_name:
          SUM | AVG | VAR_SAMP | VAR_POP | STDDEV_SAMP | STDDEV_POP

To illustrate the basic operation of aggregate calculations, consider the following example which computes the total sales for each bookshop account manager.

select name, count(*), sum(price)
    from (acctmgr join patron using(mgrid)) natural join sale natural join book
    group by 1;

NAME                        COUNT(*)              SUM(PRICE)
Doel, Frank                        5                   31745
Fox, Joe                          19                   95500
Kelly, Kathleen                   14                   67350
Kralik, Alfred                    18                   72685
Noble, Barney                      6                  234700
Novac, Klara                      21                  221650
Zonn, Amy                          9                   15660

The FROM clause needs a little explanation. A natural join between acctmgr and patron cannot be used because besides the mgrid column which is the correct join column both tables have a column called name which is not a legitimate join column as they never contain the same value. So the USING clause is specified to identify the particular common column name on which to form the join.

The count(*) give the number of detail rows (i.e., sold books) in the aggregate for each account manager. The sum(price) gives the total of all of the price values in the aggregate for each account manager.

You can see all of the detail rows that were used in the aggregate calculations by issuing the following query.

select name, price 
    from (acctmgr join patron using(mgrid)) natural join sale natural join book 
    order by 1;

NAME                            PRICE
Doel, Frank                     25000
Doel, Frank                       750
Doel, Frank                      2500
Doel, Frank                       995
Doel, Frank                      2500
Fox, Joe                         3500
Fox, Joe                        12500
Fox, Joe                          750
Fox, Joe                         1200
	...
Zonn, Amy                        1250
Zonn, Amy                        1200
Zonn, Amy                        4375
Zonn, Amy                         750
Zonn, Amy                         325

Figure 7 illustrates how aggregate calculations are performed on the detail rows that are retrieved.


Figure 7. Group By Aggregate Calculations

The WHERE clause is used to restrict the detail rows that form the aggregate to only those rows for which the WHERE conditional expression is true. The HAVING clause is used to restrict result rows that contain the summary aggregate calculations to only those result rows for which the HAVING conditional expression is true.

The following query returns those western state sponsors which have received more than 3/4 billion dollars in funding from the NSF between the years shown.

select 
    name, 
    invar(state) "state", 
    year(min(award_date)) first_year, 
    year(max(award_date)) last_year, 
    convert(sum(amount), char, 15, "$#,#") funding_total
from sponsor, award where sponsor_nm = name 
 	and state in ("AZ","CA","CO","ID","MT","NM","NV","OR","UT","WA","WY")
group by name
having sum(amount) > 750000000.00;

name                       western state first_year last_year   funding_total
California Inst of Tech    CA                  1990      2003    $900,522,261
UCAR                       CO                  1992      2003    $781,735,361

Figure 8 below shows how this query is processed.


Figure 8. Aggregate Computation with Having Clause

NSF Gender Study Example

The next example is from the NSF awards database. This is a rather involved example that shows how you can use SQL to do analytical studies based on historical data contained in a database. The conclusions that are given are the author's own based on his interpretation of the results of the queries given below.

The person table contains a list of all of the individual research investigators (jobclass = "I") and NSF program managers (jobclass = "P"). The gender of each person was not included in the original data but was deduced from the person's first name based on a modified version of the list of names available from the following web site:

http://www.gpeters.com/names/baby-names.php?report=pop_all&showcount=10000

Not all first names in the person table were in this list and hence the gender could not be deduced. Thus, the gender column values can be "M", "F", or "U". You can issue the following queries to see the totals for each gender.

select count(*) from person where gender = "M";

   COUNT(*)
      57386

select count(*) from person where gender = "F";

   COUNT(*)
      17537

select count(*) from person where gender = "U";

   COUNT(*)
      10983

Alternatively, the next query can be used to compute the same results in one pass through the person table.

select sum(if(gender="F",1,0)) female, 
       sum(if(gender="M",1,0)) male, 
       sum(if(gender="U",1,0)) unknown from person;

     FEMALE        MALE     UNKNOWN
      17537       57386       10983

It might be interesting to see what difference there is between the ratio of male to female investigators and the ratio of male to female program managers. The following query uses a group by to group the totals by jobclass.

select jobclass, sum(if(gender="F",1,0)) female, sum(if(gender="M",1,0)) male 
    from person where gender != "U" 
    group by 1;


JOBCLASS      FEMALE        MALE
I              17197       56813
P                340         573

The ratio of male to female investigators is 3.3 while the ratio for program managers is 1.7. Assuming that the program managers are NSF employees, it appears that, on a percentage basis, they hire significantly more women to oversee NSF research grants than women to whom they award the grants.

To see if there is any trend in the percentage of women granted NSF awards, you can issue the query below to see the percentage of women who were awarded NSF grants by year.

select year(award_date), 100.*sum(if(gender="F",1,0))/count(gender) pct_females
    from award natural join investigator natural join person
    where gender != "U" group by 1;

YEAR(AWARD_DATE)             PCT_FEMALES
            1989                   21.74
            1990                   22.21
            1991                   19.79
            1992                   17.90
            1993                   18.81
            1994                   17.69
            1995                   19.91
            1996                   18.82
            1997                   19.52
            1998                   20.85
            1999                   19.61
            2000                   20.02
            2001                   20.94
            2002                   21.04
            2003                   21.93

Notice that there appears to be no significant variations and certainly no trend to suggest that more women are entering into research in the sciences between the years 1989 and 2003. As noted above, the NSF does hire a greater percentage of women program managers. The following query shows the percentage by year and while the percentages are greater than in the prior result no trend is evident here either.

select year(award_date), 100.0*sum(if(gender="F",1,0))/count(gender) PCT_FEMALE_PMS
    from award join person on(prgm_mgr = name)
    where gender != "U" group by 1;

YEAR(AWARD_DATE)          PCT_FEMALE_PMS
            1989                   22.95
            1990                   24.57
            1991                   21.86
            1992                   18.71
            1993                   20.11
            1994                   17.82
            1995                   20.61
            1996                   19.50
            1997                   20.42
            1998                   21.75
            1999                   19.60
            2000                   20.57
            2001                   21.14
            2002                   20.83
            2003                   21.99

This data can be compared to the percentage of women earning doctoral degrees in science, engineering, and health between the years 1989 and 2003 according to the NSF's own data as shown in the following table.

Table 11. Percentage of Science & Engineering Doctorates Earned by Women1
Year All science,
engineering,
and health
fields
Computer sciences Engineering Life sciences Mathematics Physical sciences Psychology Social sciences
1989 29.7 17.6 8.3 38.7 18.0 19.1 56.1 34.1
1990 29.2 15.6 8.5 37.9 17.7 18.8 58.3 33.3
1991 30.3 14.6 9.0 39.2 19.2 19.2 61.4 36.9
1992 30.2 13.8 9.3 39.7 19.4 20.8 59.1 36.0
1993 31.6 15.7 9.2 42.0 23.0 20.9 61.1 37.7
1994 31.9 15.2 10.9 42.2 21.1 20.8 62.2 37.0
1995 32.8 18.7 11.6 42.4 22.3 22.5 63.6 37.8
1996 33.3 15.1 12.3 43.8 20.6 21.8 66.7 36.5
1997 34.5 16.5 12.3 44.9 23.4 22.7 66.4 38.7
1998 36.0 17.2 13.1 45.8 25.2 24.4 66.9 41.5
1999 36.5 18.3 14.8 44.8 25.6 23.6 66.8 41.7
2000 38.0 16.4 15.7 47.2 24.7 25.1 66.6 42.9
2001 38.0 18.7 16.9 47.2 27.3 25.5 66.7 42.9
2002 39.2 20.6 17.6 47.8 28.9 27.3 66.6 44.5
2003 39.4 20.3 17.3 48.5 26.6 27.8 68.1 44.8

Here trends that show an increasing percentage of women who've earned doctorates in every field are clearly evident. What isn't clear is why these same trends are not also represented in the NSF research grant awards. Now I suppose that it is possible that those person table rows in which the gender was not deducible could be a higher percentage of female than male but that does not strike me as likely. One might even ask why the researcher's gender was not included in the data collection. Perhaps it was but it was not included in the report data in order to avoid just this kind of analysis. But that is mere speculation. The culprits, if there really are any, could be anywhere not just who at NSF decides who is awarded research grants. Other data that could be significant requires tracking the gender of the proposed investigators for all grant requests including those that are rejected. If that data were to show a trend that corresponds to that in the above table then it would seem that the fault lies in the grant awards process. However, if no such trend is evident, it is possible that the problem could be inside the grant requesting institutions where the authority for approving grant requests resides with senior research management. However, other NSF data2 does show an historical increase in the percentage of women in senior faculty positions. So, since we evidently do not have all of the data, it would be "a capital mistake to theorize before one has data."