National Science Foundation Awards Database

The data used in this example has been extracted from the University of California Irvine Machine Learning Repository ( The original source data can be found at The data was processed by a Raima-developed RDM SQL program that, in addition to pulling out the data from each award document, converted all personal names to a "last name, first name, …" format and, where possible, identified each person's gender from the first name. The complete DDL specification for the NSF awards database is shown below.

The NSF Awards example is a large database and may take a few minutes to create and populate.

create table person(
    name        char(35) primary key,
    gender      char(1),
    jobclass    char(1)

create table sponsor(
    name        char(50) primary key,
    addr        char(40),
    city        char(20),
    state       char(2),
    zip         char(5),
    constraint geo_loc key(state, city)

create table nsforg(
    orgid       char(3) primary key,
    name        char(40)

create table nsfprog(
    progid      char(4) primary key,
    descr       char(40)

create table nsfapp(
    appid       char(10) primary key,
    descr       char(40)

create table award(
    awardno     integer primary key,
    title       char(200),
    award_date  date key,
    instr       char(3),
    start_date  date,
    exp_date    date key,
    amount      double key,
    prgm_mgr    char(35) references person,
    sponsor_nm  char(50) references sponsor,
    orgid       char(3)  references nsforg

create table investigator(
    awardno     integer  references award,
    name        char(35) references person

create table field_apps(
    awardno     integer  references award,
    appid       char(10) references nsfapp

create table progrefs(
    awardno     integer  references award,
    progid      char(4)  references nsfprog

Descriptions for each of the tables declared in the nsfawards database are given in the following table.

Table 2. NSF Awards Database Table Descriptions
Table Name Description
person Contains one row for each investigator or NSF program manager. An investigator (jobcclass = "I") is a person who is doing the research. The NSF program manager (jobcclass = "P") oversees the research project on behalf of the NSF. An award can have more than one investigator but only one program manager. The gender column is derived from the first name but has three values "M", "F", and "U" for "unknown" when the gender based on the first name could not be determined (about 13%).
sponsor Institution that is sponsoring the research. Usually where the principal investigator is employed. Each award has a single sponsor.
nsforg NSF organization. The highest level NSF division or office under which the grant is awarded.
nsfprog Specific NSF programs responsible for funding research grants.
nsfapp NSF application areas that the research impacts.
award Specific data about the research grant. The columns are fairly self-explanatory. For clarity the exp_data column contains the award expiration data (i.e., when the money runs out). The amount column contains the total funding amount. The instr column is a code indicating the award instrument (e.g., "CTG" = "continuing", "STD" = "standard", etc.).
investigator The specific investigators responsible for carrying out the research. This table is used to form a many-to-many relationship between the person and award tables.
field_apps NSF application areas for which the research is intended. This table is used to form a many-to-many relationship between the nsfapp and award tables.
progrefs Specific programs under which the research is funded. This table is used to form a many-to-many relationship between the nsfprog and award tables.

Note that the interpretations given in the above descriptions are my own and may not be completely accurate (e.g., it could be that NSF programs are not actually responsible for funding research grants). However, my intent is to simply use this data for the purpose of illustration (although we will later delve into an interesting gender analysis).

Note the use of the DISTINCT VALUES clause in the DDL specification. In particular, where the number of actual distinct values is significantly less than the total number of rows in the table it is important to indicate this so that the SQL query optimizer can make better choices as to access method. The Concurrent Database Access section explains in greater detail how the RDM query optimizer works.

A schema diagram for the nsfawards database is shown below. Each box represents a table and each arrow represents a one-to-many relationship between two tables. The arrow label is the foreign key column (declared using the REFERENCES clause in the DDL specification) in the target (i.e. the "many" side) table on which the relationship is formed.

Figure 5 - NSF Awards Database Schema Diagram