Update Statistics Statement

The UPDATE STATS statement will cause RaimaDB SQL to generate a file containing data distribution statistics for each column in every table declared in a database. The syntax for this statement is given below.

upd_stats_stmt:
               UPDATE {STATISTICS | STATS} ON database_name

The name of the database on which the data distribution statistics is to be generated is specified by the identifier database_name. This database must be open in the connection from which this statement is being executed. The data is derived from an evenly distributed sampling of the rows of each table and is stored in a separate database in the same directory on the TFS as the database_name.

The statistics that are accumulated by the UPDATE STATS statement are controlled by two option settings: HISTOSIZE and SAMPLING. How and when the statistics are generated are controlled by two other option settings: AUTORUN and THREADS. These options are set by the SET STATISTICS statement shown below.

set_stats:
          SET {STATISTICS | STATS} 
          [ HISTOSIZE [=] num ]
          [ SAMPLING [=] (pct_num, minrows_num, maxrows_num) ]
          [ AUTORUN [=] (pct_num, minrows_num) ]
          [ THREADS [=] num]

The amount of data to be sampled is specified using the following clauses.

The HISTOSIZE clause specifies the number of distinct values for each column whose distribution counts are to be kept. The default is 25. The minimum allowed value is 10 and the maximum allowed value is 100.

The SAMPLING clause specifies how much of the database is to be sampled and includes a minimum number of rows, maximum number of rows, and percent of table to be scanned (integer value between 0 and 100) of each table to be sampled. So, for example, a "sampling (10, 100, 10000)" means that 10% of the number of rows in each table will be sampled but at least 100 (if there are 100 rows) and no more than 10,000 rows must be sampled. The larger these values are, the more accurate will be the distribution statistics but UPDATE STATS will take longer to run. If the SAMPLING clause is omitted the system default is (10, 1000, 10000). The minimum number of rows specified must be at least 100.

The AUTORUN clause specifies how often RaimaDB SQL should automatically run an UPDATE STATS. The percent (an integer value greater than or equal to 5) specifies a percentage change in the total number of rows stored in the database since the last UPDATE STATS execution. When a percentage change in size greater than percent occurs, an UPDATE STATS will be automatically executed. The min is an integer value that specifies the minimum number of rows that must be contained in the database before an UPDATE STATS will automatically be executed. The minimum allowed min value is 100 rows.

If the AUTORUN clause is not specified then the statistics will only be updated the next time an UPDATE STATS statement is executed. If an AUTORUN was specified, it can be disabled by omitting the AUTORUN clause on a subsequent SETSTATISTICS call.

Automatic UPDATE STATS executions occur on the first database open after the specified AUTORUN conditions have occurred. It will block the application (database open call) until it completes. This means that when this occurs, the database open time delay may be quite noticeable.

The THREADS clause can be used to explicitly specify the number of threads to use during UPDATE STATS execution. If not specified, the system will invoke one thread for each CPU/core available on the computer. Each thread collects the stats on a single table. Note that multi-threaded operation is only possible when the database has been opened in shared mode.

Data distribution statistics are collected for every column declared in the database except for LONG VARCHAR/VARBINARY and ROWID PRIMARY KEY columns.

The statistics for a given database are used by the RaimaDB SQL query optimizer to estimate the potential costs associated with each possible access method that is considered by the optimizer for a given query. By setting a larger HISTOSIZE and ensuring that a representative number of rows are included in the SAMPLING, the optimizer will have good statistics to guide its access method choices. However, a larger HISTOSIZE will require more memory for storing the optimizer stats. In general, the amount of memory needed for a database's statistics is based on the HISTOSIZE and the total number of columns in the database.