With the availability of V5R3 i5/OS and DB2 UDB for iSeries, the SQL Query Engine now supports a powerful strategy for minimizing query input/output (I/O) and maximizing query performance. This new strategy appears as magic, given the Query Optimizer's ability to rewrite the query using generated local selection predicates where none were specified. This article explores the technique and benefits of look-ahead predicate generation (LPG).
In the world of query optimization, minimizing or even eliminating the reading and processing of unnecessary data is the key to good performance because I/O operations are relatively slow operations. The job of the query optimizer is to choose the appropriate methods and build a strategy that will access and process the rows as quickly and efficiently as possible. With a high number of strategies available, the optimizer will normally be able to build a query plan that meets the user response time requirements.
Within a query, the local selection predicates are used to specify which rows are to be selected for processing. One technique for eliminating rows from further processing requires reading the data from the table(s) and testing values. Another technique for eliminating rows is accomplished without actually reading the data from the table(s) but instead relies upon mathematic principles and other database objects to avoid testing rows in the table(s). For example, given a query with local selection that identifies one row out of a one-billion-row table, the database engine can read and test every row (one billion tests), or it can use an index to identify only the one matching row, eliminating all the other rows without reading and testing them. While the result will be the same, the difference in query performance between these two strategies will be significant.
Another query scenario that can cause a lot of reading (and poor performance) is joining of tables. Specifying a join usually causes the database engine to select rows from one table and perform a read operation to find any matching rows in another table. By definition, the join can reduce the result set, but only by testing for the existence of the row. The potential for reading and ultimately rejecting many rows makes this type of query problematic. Again, if the query optimizer can employ specific strategies to eliminate rows before the join, query performance can be increased and use of database server resources can be reduced.
DB2 UDB for iSeries supports many methods and employs many strategies to substantially reduce the reading of rows and the processing of data. One such strategy is called look-ahead predicate generation (LPG). This powerful and elegant feature can significantly reduce query times.
A simple join scenario will help explain why the LPG support can have such a positive affect on a query's performance.
A Couple of Methods
Assume two tables: one relatively small (SmallTableA) and one relatively large (LargeTableB). A join query is issued referencing both tables. Local selection is defined against SmallTableA, but not LargeTableB. The selectivity of the local selection predicate is very high (i.e., it identifies two rows in SmallTableA).
FROM SmallTableA A,
LargeTableB B
WHERE A.Join_Col = B.Join_Col
AND A.Col1 IN (112358, 132134)
With no indexes defined on LargeTableB, you have various table access and join processing combinations:
One strategy is to join LargeTableB to SmallTableA. Select all rows via a full table scan of LargeTableB and join every row to SmallTableA via an index or a hash table. Figure 1 illustrates this join.
Figure 1: Here, LargeTableB is joined to SmallTableA. (Click images to enlarge.)
Given no local selection, every row selected from LargeTableB will be used to try the join to SmallTableA. This combination will produce a large number of reads.
Another strategy is to join SmallTableA to LargeTableB. Select all rows via full table scan of LargeTableB to create a temporary indexed list or temporary hash table, as shown in Figure 2.
Figure 2: Here, SmallTableA is joined to LargeTableB.
Given no local selection to reduce the rows, the temporary indexed list or temporary hash table on LargeTableB will have to represent all the rows from LargeTableB, and both temporary data structures will take a relatively long time to populate. These structures will be relatively large as well.
A Couple of Better Methods
With DB2's LPG and query rewrite techniques, the join scenarios can be enhanced considerably.
If LargeTableB has some local selection defined and this local selection is somewhat selective (i.e., narrowing down the rows to be selected and processed), then the database engine can use methods that will reduce the number of rows accessed. This is where LPG comes in! The query optimizer can "look ahead" to generate local selection from other tables in the query and then transfer that local selection to another table. When this local selection is available, methods that take advantage of indexing technology can be employed to speed up the query request. (Click here for more information about indexing strategies.)
Using the example above, the local selection specified for SmallTableA will be used to identify not only the rows in SmallTableA but also the corresponding join column values. For example, the SmallTableA row containing A.Col1 = 112358 also contains A.Join_Col = Value1 (i.e., the corresponding join column value).
These distinct join column values derived from SmallTableA are used to generate local selection on LargeTableB. Now that LargeTableB has local selection, additional techniques can be applied to minimize or eliminate reading large sets of rows.
Here's a representation of the rewritten query:
FROM SmallTableA A,
LargeTableB B
WHERE A.Join_Col = B.Join_Col
AND A.Col1 IN (112358, 132134)
AND B.Join_Col IN (Value1, Value2)
The new local selection allows the optimizer to consider additional join strategies. If the (generated) local selection on LargeTableB significantly reduces the number of rows accessed, LargeTableB may be placed in the first (primary) join position and accessed via an index.
Another strategy has LargeTableB placed in the second join position. If the join on LargeTableB is implemented via a temporary hash table, the amount of data placed in the hash will also be reduced by the (generated) local selection. Populating the temporary hash table can be much faster if an index is created for the (generated) local selection. Figure 3 illustrates this strategy.
Figure 3: Here, LargeTableB is placed in the second join position.
The real magic of LPG comes into play when there is more than one table joined to the larger table.
Assume four tables: three relatively small (SmallTableA, SmallTableB, SmallTableC) and one relatively large (LargeTableD). A join query is issued referencing both small and large tables. Local selection is defined against the small tables, but not the larger LargeTableD. The selectivity of the local selection predicates is very high (i.e., it identifies few rows in SmallTableA, SmallTableB, and SmallTableC).
FROM SmallTableA A,
SmallTableB B,
SmallTableC C,
LargeTableD D
WHERE A.Join_Col = D.Join_Col1
AND B.Join_Col = D.Join_Col2
AND C.Join_Col = D.Join_Col3
AND A.Col1 IN (112358, 132134)
AND B.Col6= 'ABC'
AND C.Col4= 2005
AND C.Col5= 'January'
These are the various table access and join processing combinations for the first join pair:
- SmallTableA joined to LargeTableD
- SmallTableB joined to LargeTableD
- SmallTableC joined to LargeTableD
Select all rows via full table scan of LargeTableB to create a temporary indexed list or a temporary hash table. For all rows joined to LargeTableD, join to other two SmallTables, as shown in Figure 4.
Figure 4: Here, all rows joined to LargeTableD join to other two SmallTables.
Now, consider these table access and join processing combinations:
- LargeTableD joined to SmallTableA
- LargeTableD joined to SmallTableB
- LargeTableD joined to SmallTableC
Select all rows via full table scan of LargeTableD and join every row to SmallTableA via an index or a hash table. For all rows from LargeTableD, join to the other three SmallTables, as shown in Figure 5.
Figure 5: Here, all rows from LargeTableD join to the other three SmallTables.
Using the four-table query example above, the local selection specified for SmallTableA, SmallTableB, and SmallTableC will be used not only to identify the rows in the respective tables, but also to identify the corresponding join column values. These distinct join column values (derived from SmallTableA, SmallTableB, and SmallTableC) are used to generate local selection on LargeTableD using LPG. See Figure 6.
Figure 6: The distinct join column values use LPG to generate local selection on LargeTableD.
The rewritten query would be represented like this:
FROM SmallTableA A,
SmallTableB B,
SmallTableC C,
LargeTableD D
WHERE A.Join_Col = D.Join_Col1
AND B.Join_Col = D.Join_Col2
AND C.Join_Col = D.Join_Col3
AND A.Col1 IN (112358, 132134)
AND B.Col6= 'ABC'
AND C.Col4= 2005
AND C.Col5= 'January'
AND D.Join_Col1 IN (Value1, Value2)
AND D.Join_Col2 IN (Value1, Value2, ... Value n)
AND D.Join_Col3 IN (Value1, Value2, ... Value n)
Now that LargeTableD has local selection defined, an appropriate set of indexes can be created and applied to minimize (or eliminate) reading large sets of rows.
Yet another DB2 UDB for iSeries feature can be employed to handle the newly generated local selection--namely, index ANDing. This is the ability of the query optimizer and database engine to use more than one index to specifically identify and access rows within a given table. Using single-column-key indexes defined for each join column in LargeTableD, the database engine can merge the list of rows identified by each respective index and use the final list to access the table. The intersection (ANDing) of all three local selection predicates should result in a much narrower set of rows to be accessed and processed in LargeTableD. Using the indexes allows the database engine to avoid reading a large set of rows, resulting in a more efficient and faster query. See Figure 7.
Figure 7: Index ANDing allows the database engine to avoid reading a large set of rows.
The optimizer can also use LPG with common join requests that contain a one-to-one relationship. In this case, LPG is used to recursively look ahead and generate local selection from the neighboring table "downstream." This has the tremendous benefit of allowing the optimizer much more latitude in setting the join order. Furthermore, this strategy reduces the risk of any one particular join order delivering poor performance.
Assume four tables all roughly the same size: TableA, TableB, TableC, and TableD. A join query is issued referencing all four tables with a one-to-one relationship (A to B, B to C, C to D). Local selection is defined against only one table, TableA.
FROM TableA A,
TableB B,
TableC C,
TableD D
WHERE A.Join_Col1 = B.Join_Col1
AND B.Join_Col2 = C.Join_Col2
AND C.Join_Col3 = D.Join_Col3
AND A.Col5 = 'ABC'
Determining the best-performing query plan can be problematic, depending on the selectivity of the local selection predicates defined for TableA and the relative position of TableA in the join order. If all the tables have some local selection defined, the optimizer has many more choices for the query join order and consequently more opportunities to provide multiple high-performance query plans.
Using the four-table query example above, the local selection specified for TableA will be used not only to identify the rows in TableA, but also to identify the corresponding join column values. These distinct join column values (derived from TableA) are used to generate local selection on TableB.
The generated local selection specified for TableB will be used to identify the rows in TableB and identify the corresponding join column values. These distinct join column values (derived from TableB) are used to generate local selection on TableC.
The generated local selection specified for TableC will be used to identify the rows in TableD and identify the corresponding join column values. These distinct join column values (derived from TableC) are used to generate local selection on TableD.
Now the query has been rewritten and optimized with local selection on all four tables (Figure 8).
Figure 8: The query is optimized with local selection on all four tables.
Here's a representation of the rewritten query:
FROM TableA A,
TableB B,
TableC C,
TableD D
WHERE A.Join_Col1 = B.Join_Col1
AND B.Join_Col2 = C.Join_Col2
AND C.Join_Col3 = D.Join_Col3
AND A.Col5 = 'ABC'
AND B.Join_Col1 IN (Value1, Value2, ... Value n)
AND C.Join_Col2 IN (Value1, Value2, ... Value n)
AND D.Join_Col3 IN (Value1, Value2, ... Value n)
Being aware of and planning for LPG will allow the creation of indexes to effectively support join queries. Without this knowledge, and the corresponding indexes, better query performance will be squandered. The strategy for creating indexes for the generated local selection predicates is identical to creating indexes for user-supplied local predicates. Identifying the join columns is a clue to identifying which columns to create indexes on.
Query Magic
Like a good magician, the query optimizer uses techniques that are hidden from the audience. Only the final result is obvious. Given that query rewrite and LPG occurs out of sight, it will be helpful to get some feedback from the optimizer when this strategy is employed. One simple technique is to observe the estimated selectivity of each table in the query when that table has no user-supplied local selection predicates. Without LPG, the estimated number of rows selected should be equal to the total number of rows in the table. If the estimated number of rows selected is less than the total, then some selection is being applied.
Another more accurate technique is to rely on iSeries Navigator - Visual Explain to render the query graph. Once the query graph is displayed, any nodes that are under the influence of LPG can be highlighted (in green). To identify the use of LPG, select the "Highlight LPG" option on the View menu within the Visual Explain window.
Figure 9: iSeries Navigator - Visual Explain renders the query graph.
The key to fully utilizing an optimization capability is in integrating that capability into the normal flow of query optimization--in effect, making that capability pervasive. Traditionally, sophisticated optimization techniques rely on interrogation of the query or query environment to recognize when and where to apply the specific technique. If this identification is successful, then the technique can be successful; otherwise, the technique is useless or even detrimental.
While the DB2 UDB for iSeries LPG may seem like a specialized and rarely needed technique, it really is a pervasive and powerful tool within the optimizer's bag of tricks. The simple join examples illustrated here are but one area in which this capability can be applied. Other areas include queries against star schema and snowflake schema data models, as well as SQL requests with subqueries, common table expressions, or derived tables.
Rob Bestgen is a senior technical staff member and query optimizer design leader on the IBM DB2 for iSeries team in Rochester, Minnesota. Rob can be reached at
Mike Cain is a senior technical staff member and leader of the DB2 for iSeries Center of Competency in Rochester, Minnesota. Mike can be reached at
LATEST COMMENTS
MC Press Online