The data used in this example has been extracted from the University of California Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/). The original source data can be found at https://archive.ics.uci.edu/ml/machine-learning-databases/nsfabs-mld/nsfawards.htmll. The data was processed by a Raima-developed RDM SQL program that, in addition to pulling out the data from each award document, converted all personal names to a "last name, first name, …" format and, where possible, identified each person's gender from the first name. The complete DDL specification for the NSF awards database is shown below.
The NSF Awards example is a large database and may take a few minutes to create and populate.
create table person( name char(35) primary key, gender char(1), jobclass char(1));create table sponsor( name char(50) primary key, addr char(40), city char(20), state char(2), zip char(5), constraint geo_loc key(state, city));create table nsforg( orgid char(3) primary key, name char(40));create table nsfprog( progid char(4) primary key, descr char(40));create table nsfapp( appid char(10) primary key, descr char(40));create table award( awardno integer primary key, title char(200), award_date date key, instr char(3), start_date date, exp_date date key, amount double key, prgm_mgr char(35) references person, sponsor_nm char(50) references sponsor, orgid char(3) references nsforg);create table investigator( awardno integer references award, name char(35) references person);create table field_apps( awardno integer references award, appid char(10) references nsfapp); create table progrefs( awardno integer references award, progid char(4) references nsfprog);
Descriptions for each of the tables declared in the
nsfawards database are given in the following table.
|person||Contains one row for each investigator or NSF program manager. An investigator (jobcclass = "I") is a person who is doing the research. The NSF program manager (jobcclass = "P") oversees the research project on behalf of the NSF. An award can have more than one investigator but only one program manager. The gender column is derived from the first name but has three values "M", "F", and "U" for "unknown" when the gender based on the first name could not be determined (about 13%).|
|sponsor||Institution that is sponsoring the research. Usually where the principal investigator is employed. Each award has a single sponsor.|
|nsforg||NSF organization. The highest level NSF division or office under which the grant is awarded.|
|nsfprog||Specific NSF programs responsible for funding research grants.|
|nsfapp||NSF application areas that the research impacts.|
|award||Specific data about the research grant. The columns are fairly self-explanatory. For clarity the exp_data column contains the award expiration data (i.e., when the money runs out). The amount column contains the total funding amount. The instr column is a code indicating the award instrument (e.g., "CTG" = "continuing", "STD" = "standard", etc.).|
|investigator||The specific investigators responsible for carrying out the research. This table is used to form a many-to-many relationship between the person and award tables.|
|field_apps||NSF application areas for which the research is intended. This table is used to form a many-to-many relationship between the nsfapp and award tables.|
|progrefs||Specific programs under which the research is funded. This table is used to form a many-to-many relationship between the nsfprog and award tables.|
Note that the interpretations given in the above descriptions are my own and may not be completely accurate (e.g., it could be that NSF programs are not actually responsible for funding research grants). However, my intent is to simply use this data for the purpose of illustration (although we will later delve into an interesting gender analysis).
Note the use of the DISTINCT VALUES clause in the DDL specification. In particular, where the number of actual distinct values is significantly less than the total number of rows in the table it is important to indicate this so that the SQL query optimizer can make better choices as to access method. The Concurrent Database Access section explains in greater detail how the RDM query optimizer works.
A schema diagram for the
nsfawards database is shown below. Each box represents a table and each arrow represents a one-to-many relationship between two tables. The arrow label is the foreign key column (declared using the REFERENCES clause in the DDL specification) in the target (i.e. the "many" side) table on which the relationship is formed.
Figure 5 - NSF Awards Database Schema Diagram