You are here: SQL API > SQL Language Guide > User's Guide > Defining a Database

Defining a Database

But Vronsky felt that now especially it

was essential for him to clear up

and define his position if he were

to avoid getting into difficulties.

- Leo Tolstoy, Anna Karenin

A poorly designed database can create all kinds of difficulties for the user of a database application. Unfortunately, the blame for those difficulties are often laid at the feet of the database management system which, try as it might, simply cannot use non-existent access paths to quickly get at the needed data. Good database design is as much of an art as it is engineering and a solid understanding of the application requirements is a necessary prerequisite. However, it is not the purpose of this document to teach you how to produce good database designs. But you do need to understand that designing a database is a complex task and that the quality of the application in which it is to be used is highly dependent on the quality of the database design. If you are not experienced in designing databases then it is highly recommended that you first consult any number of good books on that subject before setting out to develop your RDM Embedded SQL database.

A database schema is the definition of what kind of data is to be stored and how that data is to be organized in the database. The Database Definition Language (DDL) consists of the SQL statements that are used to describe a particular database schema (also called the database definition). Three DDL statements are provided in RDMe SQL: create database (schema), create domain, and create table. The create database (schema) statement names the database that will be defined by the create domain and create table statements that follow it. The create domain statement allows you to define a special-purpose data type that can be used by a subsequent create table statement in the declaration of a table column. The create table statement is used to define the characteristics of a table that will be stored in the database. Each of these DDL statements are described in detail in the following sections.

Create Database

The create database statement must be the first DDL statement issued for a new database specification. The syntax for this statement is as follows.

create_schema_stmt:
 		create {schema | database} db_name
			[pagesize = num] [inmemory [persistentvolatileread]] [dba4dba8]

Use of "schema" (instead of "database") follows the ISO/ANSI SQL standard convention. The pagesize and inmemory options are RDMe SQL extensions. The pagesize option sets the default page size for all of the database files. If not specified, the default page size is 1024 bytes. The inmemory option indicates that the database is to be kept entirely in memory. The read, persistent, and volatile options control whether the database files are read from disk when the database is opened (read, persistent), and whether they are written to the disk when the database is closed (persistent). The default inmemory option is volatile which means that the database is created empty the first time it is opened and will cease to exist either after the last application closes the database (e.g. Windows) or when the system is rebooted (Unix). The read option means that the entire database is read from the files when the database is opened, changes to the data are allowed but are not written back to the files on closing. The persistent option means that the entire database is read on opening and all changes that were made while the database was open are written when the database is closed.

The database consists of all of the tables that are declared in the create table statements that are issued after the create database statement.

Examples

create database sales;

create database usage_stats no nulls pagesize = 512;

Create Domain

A "domain" is simply a user-defined and named data type which can then be specified as the data type for columns that are declared in a create table statement. The syntax for the create domain statement is shown below.

create_domain_stmt:
		create domain   domain_name [as] data_type
			[default {constant | null}]

The name of the domain is specified as the domain_name. The data_type specifies the base type for the domain. A constant value or null can be specified as the default.

The distinct values clause specifies the number of distinct values that will be stored in columns of this type. The range clause specifies the minimum and maximum values that will be stored in columns of this type. These two clauses provide important information that is only used by the RDMe SQL query optimizer to determine the best possible execution plan for a query. Note that these clauses do not specify column validation checks. It will still be possible to store values that are outside of the specified range.

The data types that are available in RDMe SQL are given in the following syntax specification.

data_type:
		base_type | blob_type
base_type:
 	 	{character | char } [(length)]
 	|	{{character | char} varying | varchar } (length)
 	|	{double [precision] | float | real }
 	|	{ tinyint | smallint | int | integer | long | bigint}
 	|	date | time | timestamp   
blob_type:
 		{{character | char} large object | long varchar | clob} [pagesize   num]
       |	{binary large object   large varbinary | blob} [pagesize   num]

Each specific blob instance is stored in a separate set of blob file pages using only as many pages as are needed to store the entire blob. If the size of the blob data is less than a page the unused space on that page will remain unused. Hence, you should probably supply a pagesize value that will minimize the amount of unused page space based on the average size of your blob data.

Examples

create domain birth_date as
 	date range date "1900-01-01" to date "2011-01-01";
create domain gender as
	char distinct values = 2;
create domain us_state as
	char(2) distinct values = 53

Create Table

Standard Database Table

The create table statement is used to define a table to be included in the database. Create table statements can only be issued after the create database statement and before issuing any other non-DDL statements. Any domain types that are used in column declarations included in the create table statement must have already been declared through the issuance of a prior create domain statement. The syntax for the create table statement is as follows.

standard_table:
 		create [circular] table   table_name (
 			column_def[, column_def]...
 			[, key_def[, key_def]...]
 		) [pagesize = num] [inmemory [persistent | volatile | read]]
			[maxpgs = constant] [maxrows = constant]
column_def:
 		column_name {type_spec | domain_name}
			[distinct values = num] [range   constant   to   constant]
			[not null] [key_spec] [refs_spec]
type_spec:
 		data_type [default {constant | null}]
key_def:
		[primary | unique] key [hash {(num)of   num   rows}] ['['keysize']'] [key_name] 
 			(column_name[, column_name]...)
			[pagesize = num] [inmemory [persistentvolatile | read]] [maxpgs = constant] 
 	|	foreign key [set_name] (column_name[, column_name]...)   references   table_name[(column_name[, column_name]...)]
 			[triggered_action]
refs_spec:
		references   table_name[.column_name] [triggered_action]
triggered_action:
 		on update {cascade | restrict} [on delete {cascade | restrict}]
 	| 	on delete {cascade | restrict} [on update {cascade | restrict}]

The table_name is a user-specified identifier that names the table. The contents of the table is comprised of the columns that are declared within it. Columns are declared to be of a specific data type which is either explicitly given or specified through use of a previously declared domain name. A default value and display format can also optionally be specified unless the column was declared with a domain type.

The distinct values clause specifies the number of distinct values that will be stored in this column. The range clause specifies the minimum and maximum values that will be stored in the column. These two clauses provide important information that is only used by the RDMe SQL query optimizer to determine the best possible execution plan for a query. Note that these clauses do not specify column validation checks. It will still be possible to store values that are outside of the specified range.

Columns can be specified with one or more constraints which declare the column to be:

Foreign key references are automatically implemented using RDMe sets. The name of the column becomes the name of the RDMe set. The RDMe record type into which the SQL table is mapped will not contain a data field for this column. The SQL column value is retrieved through the owner of the set—i.e., the primary key column's value. A triggered_action can be specified with foreign key columns in order to specify what should happen when the referenced row (the owner record instance) is updated or deleted. The default action is restrict meaning that primary key rows that have existing foreign key references cannot be updated/deleted. If on ... cascade is specified, then all of the referenced rows are updated or deleted when the primary key row is updated (i.e., the primary key column value) or deleted. Note that the referencing table may itself have a primary key declared that is referenced by foreign keys in other tables that may not have a cascade triggered action specified. Thus, a delete of the referenced row of a cascade delete allowed table may be denied due to a restrict foreign key on a row of a referencing table.

A key_def on a table is used to declare multi-column primary/unique/non-unique keys and foreign keys. The [primary | unique] key clause is used to identify the columns from the table on which a key is to be formed. A table can have only one primary key. If a key_name is specified then that will be the name of the RDMe compound key. If not specified a unique system-generated name will be used.

Each table is contained in a separate RDMe data file. Each key is contained in a separate RDMe key file. The values for each blob type column is stored in a separate RDMe blob file. The file_option can optionally be specified to provide RDMe-specific file characteristics.

Examples

create table sales_office(
 	office_id char(3) primary key,
 	city char(24),
 	state char(2)
);
create table salesperson(
 	sale_id rowid primary key,
 	name char(38) not null,
 	sex gender,
   dob birth_date,
 	hired_on date default today,
 	sales_tot double,
 	office char(3) references outlet,
 	mgr_id rowid references salesperson,
 	unique key sale_key(name, office)
);
create table customer(
 	cust_id rowid primary key,
	name char(38),
 	sale_id rowid not null 
 		references salesperson 
 			on update cascade 
 			on delete restrict
);

Virtual Table

An RDMe SQL virtual table is defined through a combination of the create virtual table statement and a set of user-written C functions that conform to a particular interface specification. A pointer to a pre-defined structure array that contains and entry for each virtual table with the addresses of each of the virtual table interface functions is passed into SQL before the database is opened. These functions are then called by SQL at the appropriate times during the execution of any SQL statement that references the virtual table. This interaction is depicted in Figure 4 which shows SQL calling the function in the application's virtual table function module to fetch a row of weather data from a wireless sensor network (WSN). Note that in this example by storing the data retrieved from the virtual table in a standard table, RDMe can then replicate that data to an outside host DBMS (e.g., RDM Server or some other well-known SQL DBMS). Also note that the green boxes represent code that is compiled as part of the user's application while the blue boxes represent RDMe systems code.

The syntax for the create virtual table statement is given below.

virtual_table:
 		create virtual [read only] table   table_name (
			vcolumn_def[, vcolumn_def]…
 		)
vcolumn_def:
		column_name base_type
			[distinct values = num] [range   constant   to   constant]
			[primary key]
base_type:
 	 	{character | char } [(length)]
 	|	{{character | char} varying | varchar } (length)
 	|	{double [precision] | float | real }
 	|	{ tinyint | smallint | int | integer | long | bigint}
 	|	date | time | timestamp   


Figure 4. Virtual Tables in RDMe SQL

No create virtual table statement for a given database can be submitted until all standard create table statements have first been submitted. In other words, the create virtual table statements must all come at the end of your database schema specification. Only one primary key column declaration can appear in a create virtual table statement. Values for this column must be unique and will be used by SQL in calls to the user-function in the virtual table interface API to find the row for a specified value.

The DDL schema specification for the aforementioned wireless weather sensor database is given in the following example.

create database weather_db;
create table location(  /* location of weather sensor */
 	longitude integer,
 	latitude integer,
 	sensor_id bigint,
 	descr char(48),
 	county char(24),
 	state char(2),
 	primary key loc_id(longitude, latitude)
);
create table weather_summary(
 	longitude integer,
 	latitude integer,
 	rdg_date date,
 	hour_of_day smallint,
 	avg_temp smallint,
 	avg_ press smallint,
 	avg_hum smallint,
 	avg_lumens smallint,
 	foreign key (longitude, latitude) references location
);
create virtual readonly table weather_data(
 	sensor_id bigint primary key,
 	loc_long integer,
 	loc_lat integer,
 	rdg_time timestamp,
 	temperature smallint,
 	pressure smallint,
 	humidity smallint,
 	light smallint,
 	power integer
);

Compiling a DDL Specification

Of course, you can interactively enter your DDL statements using rdmesql (or any other ODBC-based SQL utility) but you will normally create the DDL specification for your database using a text editor and storing it in a text file. The name of this file can be anything except dbname.ddl because that is the name of a file that RDMe SQL automatically generates that contains the core-level RDMe DDL (non-SQL) specification that corresponds to your SQL DDL. A good convention is to store SQL scripts in files with a ".sql" extension. A convention that I like to use is to name the DDL specification file "dbname.sql". For example, the DDL files for the two example databases described in the next section are "nsfawards.sql" and "bookshop.sql".

Assuming you too use the same convention. you can use rdmesql to compile an SQL DDL file as follows.

rdmesql –b [@hostname:port] dbname.sql

If the @hostname:port is not specified, @localhost:21553 will be used. Errors will be reported to stdout and identify the file and line number of the offending SQL statement. A successful compilation of a DDL specification will produce the dbname_cat.c and dbname_cat.h files in the current directory and the database dictionary file (dbname.dbd), catalog file (dbname.cat), data files (dbname.d*), and key files (dbname.k*) in a directory named dbname on the TFS. The database will be initialized and ready for use.

Example Databases

Two example databases are provided with RDMe SQL that facilitate learning how to use RDMe SQL and will be used in most of the examples given in this book. This section describes the two databases by presenting the DDL specifications along with an explanation of how that data would be used in a SQL application. The first database contains actual data derived from over 130,000 National Science Foundation (USA) research grants that were awarded during the years 1990 through 2003. The second database is for a hypothetical bookshop that only sells high-end, rare antiquarian books.

National Science Foundation Awards Database

The data used in this example has been extracted from the University of California Irvine Knowledge Discovery in Databases Archive (http://kdd.ics.uci.edu/). The original source data can be found at http://kdd.ics.uci.edu/databases/nsfabs/nsfawards.html. The data was processed by Raima-developed RDMe SQL program that, in addition to pulling out the data from each award document, converted all personal names to a "last name, first name, …" format and, where possible, identified each person's gender from the first name. The complete DDL specification for the NSF awards database is shown below.

NOTE: The NSF Awards example is a large database and may take a few minutes to create and populate.

create database nsfawards;
create table person(
    name        char(35) primary key,
    gender      char(1) distinct values = 3,
    jobclass    char(1) distinct values = 2
);
create table sponsor(
    name        char(50) primary key,
    addr        char(40),
    city        char(20),
    state       char(2) distinct values = 100,
    zip         char(5)
);
create table nsforg(
    orgid       char(3) primary key,
    name        char(40)
);
create table nsfprog(
    progid      char(4) primary key,
    descr       char(40)
);
create table nsfapp(
    appid       char(10) primary key,
    descr       char(40)
);
create table award(
    awardno     integer primary key,
    title       char(200),
    award_date  date key,
    instr       char(3) distinct values = 11,
    start_date  date,
    exp_date    date key,
    amount      double key,
    abstract    long varchar,
    prgm_mgr    char(35) references person,
    sponsor_nm  char(50) references sponsor,
    orgid       char(3)  references nsforg
);
create table investigator(
    awardno     integer  references award,
    name        char(35) references person
);
create table field_apps(
    awardno     integer  references award,
    appid       char(10) references nsfapp
);   
create table progrefs(
    awardno     integer  references award,
    progid      char(4)  references nsfprog
);

Descriptions for each of the tables declared in the nsfawards database are given in the following table.

Table 4. NSF Awards Database Table Descriptions
Table Name Description
person Contains one row for each investigator or NSF program manager. An investigator (jobcclass = "I") is a person who is doing the research. The NSF program manager (jobcclass = "P") oversees the research project on behalf of the NSF. An award can have more than one investigator but only one program manager. The gender column is derived from the first name but has three values "M", "F", and "U" for "unknown" when the gender based on the first name could not be determined (about 13%).
sponsor Institution that is sponsoring the research. Usually where the principal investigator is employed. Each award has a single sponsor.
nsforg NSF organization. The highest level NSF division or office under which the grant is awarded.
nsfprog Specific NSF programs responsible for funding research grants.
nsfapp NSF application areas that the research impacts.
award Specific data about the research grant. The columns are fairly self-explanatory. For clarity the exp_data column contains the award expiration data (i.e., when the money runs out). The amount column contains the total funding amount. The instr column is a code indicating the award instrument (e.g., "CTG" = "continuing", "STD" = "standard", etc.).
investigator The specific investigators responsible for carrying out the research. This table is used to form a many-to-many relationship between the person and award tables.
field_apps NSF application areas for which the research is intended. This table is used to form a many-to-many relationship between the nsfapp and award tables.
progrefs Specific programs under which the research is funded. This table is used to form a many-to-many relationship between the nsfprog and award tables.

Note that the interpretations given in the above descriptions are my own and may not be completely accurate (e.g., it could be that NSF programs are not actually responsible for funding research grants). However, my intent is to simply use this data for the purpose of illustration (although we will later delve into an interesting gender analysis).

Note the use of the distinct values clause in the DDL specification. In particular, where the number of actual distinct values is significantly less than the total number of rows in the table it is important to indicate this so that the SQL query optimizer can make better choices as to access method. Chapter 9 explains in greater detail how the RDMe query optimizer works.

A schema diagram for the nsfawards database is shown below. Each box represents a table and each arrow represents a one-to-many relationship between two tables. The arrow label is the foreign key column (declared using the references clause in the DDL specification) in the target (i.e. the "many" side) table on which the relationship is formed.


Figure 5 - NSF Awards Database Schema Diagram

Antiquarian Bookshop Database

Our fictional bookshop is located in Hertford, England (a very real and charming town north of London). It is located in a building constructed around 1735 and has two rather smallish rooms on two floors with floor-to-ceiling bookshelves throughout. Upon entering, one is immediately transported to a much earlier era being quite overwhelmed by the wonderful sight and odor of the ancient mahogany wood in which the entire interior is lined along with the rare and ancient books that reside on them. There is a little bell that announces one's entrance into the shop but it is not really needed, as the delightfully squeeky floor boards quite clearly makes your presence known.

In spite of the ancient setting and very old and rare books, this bookshop has a very modern Internet storefront through which it sells and auctions off its expensive inventory. A computer system contains a database describing the inventory and manages the sales and auction processes. The database schema for our bookshop is given below.

create database bookshop;

create table author(
    last_name   char(13) primary key,
    full_name   char(35),    
    gender      char(1) distinct values = 2,
    yr_born     smallint,
    yr_died     smallint,
    short_bio   varchar(250)
);

create table genres(
    text       char(31) primary key
);

create table subjects(
    text       char(51) primary key
);

create table book(
    bookid      char(14) primary key,
    last_name   char(13) 
        references author on delete cascade on update cascade,
    title       varchar(255),
    descr       char(61),
    publisher   char(136),
    publ_year   smallint key,
    lc_class    char(33),
    date_acqd   date,
    date_sold   date,
    price       double,
    cost        double
);

create table related_name(
    bookid      char(14) 
        references book on delete cascade on update cascade,
    name        char(61)
);
    
create table genres_books(
    bookid      char(14) 
        references book on delete cascade on update cascade,
    genre       char(31) 
        references genres
);

create table subjects_books(
    bookid      char(14) 
        references book on delete cascade on update cascade,
    subject     char(51) 
        references subjects
);

create table acctmgr(
    mgrid       char(7) primary key,
    name        char(24),
    hire_date   date,
    commission  double
);

create table patron(
    patid       char(3) primary key,
    name        char(30),
    street      char(30),
    city        char(17),
    state       char(2),
    country     char(2),
    pc          char(10),
    email       char(63),
    phone       char(15),
    mgrid       char(7) 
        references acctmgr
);

create table note(
    noteid      integer primary key,
    bookid      char(14) 
        references book on delete cascade on update cascade,
    patid       char(3) 
        references patron on delete cascade on update cascade
);

create table note_line(
    noteid      integer 
        references note on delete cascade on update cascade,
    text        char(61)
);

create table sale(
    bookid      char(14) 
        references book on delete cascade on update cascade,
    patid       char(3) 
        references patron on delete cascade on update cascade
);

create table auction(
    aucid       integer primary key,
    bookid      char(14) 
        references book on delete cascade on update cascade,
    mgrid       char(7) 
        references acctmgr,
    start_date  date,
    end_date    date,
    reserve     double,
    curr_bid    double
);

create table bid(
    aucid       integer 
        references auction on delete cascade on update cascade,
    patid       char(3) 
        references patron on delete cascade on update cascade,
    offer       double,
    bid_ts      timestamp
);

Descriptions for each of the above tables are given below.

Table 5. Bookshop Database Table Descriptions
Table Name Description
author Each row contains biographical information about a reknowned author.
book Contains information about each book in the bookshop inventory. The last_name column associates the book with its author. Books with a non null date_sold are no longer available.
genres Table of genre names (e.g., "Historical fiction") with which particular books are associated via the genres_books table.
subjects Table of subject names (e.g., "Cape Cod") with which particular books are associated via the subjects_books table.
related_name Related names are names of individuals associated with a particular book. The names are usually hand-written in the book's front matter or on separate pages that were included with the book (e.g., letters) and identify the book's provenance (owners). Only a few books have related names. However, their presence can significantly increase the value of the book.
genres_books Used to create a many-to-many relationship between genres and books.
subjects_books Used to create a many-to-many relationship between subjects and books.
note Connects each note_line to its associated book. Notes include edition info and other comments (often coded) relating to its condition.
note_line One row for each line of text in a particular note.
acctmgr Account manager are the bookshop employees responsible for servicing the patrons and managing auctions.
patron Bookshop customers and their contact info. Connected to their purchases/bids through their relationship with the sale and auction tables.
sale Contains one row for each book that has been sold. Connects the book with the patron who acquired through the bookid and patid columns.
auction Some books are auctioned. Those that have been (or currently being) auctioned have a row in this table that identifies the account manager who oversees the auction. The reserve column specifies the minimum acceptable bid, curr_bid contains the current amount bid.
bid Each row provides the bid history for a particular auction.

Foreign keys are declared using the references clause. Many are specified with the on delete/update cascade option indicating that deletions or updates to the referenced rows will cause the referencing row to automatically be deleted or updated as well.

A schema diagram depicting the intertable relationships is shown below. As was mentioned above for the NSF awards database, the arrows represent a one-to-many relationship between the source and target tables and labels on the arrows identify the foreign key in the target table on which the relationship is formed.


Figure 6 - Bookshop Database Schema Diagram

The sample data that is included with this example contains book descriptions that were obtained from the United States Library of Congress online card catalog: http://catalog.loc.gov. The short biographical sketches included with each author entry are condensed descriptions from information about each author contained on Wikipedia: http://www.wikipedia.org. The use of the Wikipedia information is governed by the Creative Commons Attribution-ShareAlike license: http://creativecommons.org/licenses/by-sa/3.0/. Pricing information and the JPEG files of photographs of some of the books in the database were obtained from the website for Peter Harrington Antiquarian Bookseller in Chelsea London, http://www.peterharrington.co.uk, which is a perfect real-world example of the kind of bookshop depicted in this example.


Copyright © 2011, Raima Inc. All rights reserved.