An Optimizer Generator Based on C++

OPTL is a language for specifying database query optimizers that captures a large portion of the optimizer specification information in a declarative manner. It extends C++ with a number of tree manipulation constructs and with a rule language for specifying query transformations. OPTL is based on a model of query rewriting where query expressions carry annotations that are propagated during query transformation and planning. This framework is reminiscent of inherited and synthesized attributes for attribute grammars, and it is expressive of a wide range of information: logical and physical properties, both desired and delivered, cost estimates, optimization contexts, and control strategies. OPTGEN is a C++ preprocessor that maps OPTL specification into executable code (C++ code). In contrast to most optimizer specification architectures, OPTGEN is a complete tool that produces real code from declarative specifications that include a large range of information. OPTL can capture a wide range of optimization frameworks, in a declarative and easily extensible form. It is well-suited for experimentation and fast prototyping.

The query optimization task can be seen as a mapping from logical algebraic terms that represent database queries into physical plans that represent scripts for evaluating the queries, such that the produced plans are optimal with respect to some cost model. This mapping can be captured in the form of a rule-based term-rewriting system. The optimization process involves searching the space of equivalent program forms, generated mainly by the syntactic transformations captured as rewrite rules, and by the alternative access paths for accessing the same database objects, such as by the indices for retrieving data from relational tables.

Rule-based term-rewriting systems are characterized by their declarative structure. They support a partial separation of control from the semantics of the rewrite rules. This separation and the independence between rules are properties that make the task of the optimizer specification more manageable. Rule-based systems are powerful tools for systematic description of query optimizers, supporting easy creation, extension, and maintenance of large systems. They facilitate error-free optimizer specifications as well as experimentation and fast prototyping.

For example, consider the following rule expressed in OPTL:

join(`x,`y,and(`p1,`p2))
       = intersect(join(`x,`y,`p1),join(`x,`y,`p2));

The first line gives the rule head, which is a term pattern with pattern variables x, y, p1, and p2, while the second line gives the rule body, which is a term construction. The logical operation join(x,y,p) has three parameters: the outer relation x, the inner relation y, and the join predicate p. This rule indicates that a join whose predicate is a conjunction of two predicates p1 and p2 is transformed into an intersection of two joins: the first with predicate p1 and the second with p2. When a logical expression matches the head of the rule, all pattern variables are bound to the subterms of this logical expression. These subterms are then optimized by the same rewriting process. The rule body constructs a new term by replacing the variables with the new optimized terms associated with the variables.

A rule describes a pattern of a local syntactic transformation on an expression tree. Most rule-based term-rewriting systems in addition support a partial form of context sensitivity in the form of rule guards: that is, predicates that test semantic properties of the term being transformed. Experience with query optimizers and other term-rewriting systems suggests the need for more sophisticated context control. For example, query optimizers for relational databases that support sort-merge joins need to enforce sort order on some subplans. In addition, there are often groups of rules that should be applied in a strict or partial order. In general, there might exist a transformation script that describes a strategy for the successive application of atomic transformations. Incorporating such information into the search strategy could result in a system whose control is not completely isolated from its rewrite rules, thus lessening the benefits of rule-based systems. The alternative is to incorporate this knowledge about rule ordering as part of the rewrite context, switching context after each rule application. This approach implies that rules should not consist of pure syntactic transformations only, but include semantic transformations to the context as well.

Attribute grammars suggest a method of solving this problem: they provide a declarative way of expressing both syntactic and semantic transformations. Attribute grammars have been introduced as a method to describe context-sensitive semantics on top of a context-free syntactic base. They have been used as a tool for formally specifying programming languages as well as for automatic construction of their compilers. They introduce two types of attributes: inherited and synthesized. Semantic transformation at each grammar production is indicated by specifying how these attributes are propagated up and down a parse tree. OPTL is based on attribute grammars, but there are some important differences. Even though a rewrite rule resembles a grammar production rule, terms in our language are not parse trees and attributes are propagated not only through rewrites but also inside of terms. By using term-rewriting systems with attributes we gain many advantages. The attribute grammar framework is adequate to capture other optimizer specification frameworks, such as Volcano, in a more uniform and concise way. In fact, attributes are generalizations of logical and physical properties used in that and other optimizer frameworks. In addition, specifications are clearer because of the functional evaluation of attributes. Semantic and syntactic definitions are separated, yet they are both integrated into the same patterns of terms manipulated at each rule: each rule now specifies how terms are transformed and how attributes are propagated. The alternative is to maintain the rewrite context globally, performing destructive updates when a different context is needed between rewrites. This approach often leads to obscure specifications with potential for errors, especially when the rewriting system is specified by rules whose interleaving evaluation is unpredictable.

As an example of how attributes are propagated in our specification language consider the following rule taken from the domain of relational query optimization:


{ order = `expected_order; }
   join( `x <- { order = `expected_order; },
         `y <- { order = order(); },
         `pred )
   = nested_loop(`x,`y,`pred)
     { order = ^x.order; };

This rule indicates that the algebraic operation join(x,y,pred) is transformed into the physical algorithm nested_loop(x,y,pred). The name order in the first three lines of the rule is an inherited attribute that represents the required (expected) sort order of the stream of tuples generated by a term. If a term is a join from which we expect any order expected_order, then the expected sort order for the outer relation x is also expected_order, while the expected order for the inner relation y is empty. That is, this rule propagates the expected order only to the outer relation of the join, because the order of the inner relation does not affect the sort order of the join result. All inherited attributes are passed from the term being rewritten to its subterms as is, except in the case when these subterms are annotated, as it is the case for the first and the second parameters of the join. Thus, the order binding of the outer relation could be omitted because the default value of the propagated order is expected_order. The name order at the last line of the rule is a synthesized attribute. It indicates the real order of the produced plan. When the rule completes its execution, the synthesized order of the produced plan is set to the real order of the outer input, which is the value of the synthesized attribute order of the variable x (indicated by ^x.order).

Each optimizer specification in OPTL consists of a number of rule-bases. Currently, OPTL allows any number of logical modules (where each one contains rules to transform algebraic terms to algebraic terms) and only one physical module (that contains rules to transform algebraic terms to plans). These modules are composed in a prespecified way: they are executed in sequence, where each module applies to the results of the previous module. There are many additional features in OPTL that are omitted here. They will be described as needed when they used in the rest of the paper.

Next page: OPTL Reference Guide

Last modified: 2/19/97 by Leonidas Fegaras