Cost-Based Optimization

The cost to determine the execution plan is the time it takes the optimizer to find the "optimal" plan. An execution plan consists of n steps where n is the number of tables listed in the FROM clause. Each step of the plan specifies the table to be accessed and the method to be used to access rows from that table. The cost increases factorially to the number of tables listed in the FROM clause (n!). Performance impacts start to become noticeable for queries that reference more than about 15-20 tables (this will, of course, vary depending on the speed of your processor). This is due to the increasing number of combinations of access orderings that must be considered (2 tables have 2 possible orderings, 3 have 6, 4 have 24, etc.). The cost to estimate each candidate plan also includes a linear factor of the number of access methods available at each step in a plan from which the optimizer must choose. More access methods means the optimizer must do more work, but the odds of finding a good plan improve.

The cost to carry out an execution plan is the total number of file reads required to access the necessary database information. Because it is extremely difficult to accurately estimate the effects caused by caching performance and diverse database page sizes, physical disk read estimates are not possible. Hence, the system estimates the number of logical file reads based on an analysis of the number required to read a row for each access method.

The statistics maintained for use by cost-based optimizers are used to: 1) guide the choice between alternative access methods derived from the relational expressions specified in the WHERE clause, 2) estimate the number of output rows that result from each plan step, and 3) estimate the number of logical reads incurred by each possible access method.

The statistics used by the RDM cost-based optimizer include:

Number of rows in a table
Number of rows per page in a table (database I/O is performed a page at a time)
Depth of an index's B-tree
Number of keys per page in an index
Distribution counts of common column values
The range of possible values in a column
The number of distinct values in a column

The distribution counts of common column values are collected through the execution of the UPDATE STATS statement and provide the optimizer with the most accurate data. For databases on which an UPDATE STATS has not been executed, the last two stats can be specified by the user through DISTINCT VALUES and RANGE clauses of the CREATE DOMAIN and CREATE TABLE statements or the SET COLUMN STATS statement.

Most SQL implementations adopt a cost-based approach because the quality of the execution plan that is chosen does not depend on how a particular query is formulated. An alternative optimization approach is called rule-based optimization in which the tables are strictly accessed in the order in which the tables are specified in the FROM clause. This method places a greater responsibility on the part of the query formulator to understand the best way for the query to be processed. This is not to suggest that cost-based optimization frees the query developer of having to put any thought into how the query should be constructed (re: opening paragraph of this chapter). If that were so then this discussion would not be necessary. Nevertheless, cost-based optimizers will more reliably produce higher quality query execution plans but no optimization strategy is perfect.