27
Mon, Jan
4 New Articles

Making the Best Use of V6 SQL OLAP Functions

SQL
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Getting the most benefit out DB2 for i requires keeping the concept of "set at time" in mind and making use of all the latest features and functions.

By Mike Cain

In Dan Cruikshank's August 2008 article "Working with Data Sets," Dan illuminated the power of SQL "set at a time" processing, and he used some of the new DB2 for i 6.1 OLAP functions to do it. In this article, I'll expand on this concept and demonstrate some additional uses for OLAP functions, as well as reiterate the reasons that SQL can be so powerful for the data-centric programmer. We'll also take a look at some of the performance considerations for more-complex SQL requests.

 

First, a review of the SQL OLAP functions available as described in the DB2 for i SQL Reference publication:

•·                RANK or DENSE_RANK specifies that the ordinal rank of a row within the window is computed. Rows that are not distinct with respect to the ordering within their window are assigned the same rank. The results of ranking may be defined with or without gaps in the numbers resulting from duplicate values.

RANK specifies that the rank of a row is defined as 1 plus the number of rows that strictly precede the row. Thus, if two or more rows are not distinct with respect to the ordering, there will be one or more gaps in the sequential rank numbering.

DENSE_RANK specifies that the rank of a row is defined as 1 plus the number of preceding rows that are distinct with respect to the ordering. Therefore, there will be no gaps in the sequential rank numbering.

 

•·                ROW_NUMBER specifies that a sequential row number is computed for the row within the window defined by the ordering, starting with 1 for the first row. If the ORDER BY clause is not specified in the window, the row numbers are assigned to the rows in arbitrary order, as returned by the subselect (not according to any ORDER BY clause in the SELECT statement).

 

•·                GROUPING SETS allows multiple grouping clauses to be specified in a single statement. This can be thought of as the union of two or more groups of rows into a single result set.

 

•·                ROLLUP is an extension to the GROUP BY clause that produces a result set containing subtotal rows in addition to the ″regular″ grouped rows. Subtotal rows are ″super aggregate″ rows that contain further aggregates whose values are derived by applying the same column functions that were used to obtain the grouped rows.

 

•·                CUBE is an extension to the GROUP BY clause that produces a result set that contains all the rows of a ROLLUP aggregation and, in addition, contains ″cross-tabulation″ rows. Cross-tabulation rows are additional ″super aggregate″ rows that are not part of an aggregation with subtotals.

 

To set up our simple example, let's assume we have three tables that describe students, classes enrolled in, and activities signed up for (Figure 1). The classes and activities are related to students by the studentid column.

 

create table students (studentid int, studentname char(20));

create table classes (studentid int, class char(20), fee decimal(7,2));

create table activities (studentid int, activity char(20), fee decimal(7,2));

102208CainFigure1.GIF

Figure 1: Our example shows students and their classes and activities.

 

Our job is to produce a report that shows all students along with their respective classes and activities, the fees associated with each class and activity, the total class and activity fees for each student, and a final total of all fees represented in the set (Figure 2).

 

102208CainFigure2.GIF
Figure 2: All students' class and activity costs are totaled. (Click images to enlarge.)

 

Looking at the report, we see that the various activities and classes for each student are not related, yet we need to fully populate each line of the report with information. At first blush, this is not a trivial task, especially if we want to minimize the number of passes through the data. Enter SQL and OLAP functions.

 

To understand how to accomplish this with SQL, we must recognize the work that needs to be done based on the given data model.

 

For each student, we must list the classes and the fees associated with these classes, calculate the total fees, and then order by student. The following SQL statement represents this request:

 

select            s.studentname,

            c.class,

            sum(fee) as total_class_fee

from        students s,

            classes c

where       s.studentid = c.studentid

group by    s.studentname,

            c.class

order by    s.studentname;

For each student, we must also list the activities and the fees associated with these activities, calculate the total fees, and then order by student. The following SQL statement represents this request:

 

select            s.studentname,

            a.activity,

            sum(fee) as total_activity_fee

from        students s,

            activities a

where       s.studentid = a.studentid

group by    s.studentname,

            a.activity

order by    s.studentname;

 

If we were to union the two results together, we would most certainly provide all the results in one set, but the report will contains holes or gaps. In other words, the rows representing classes and activities for each student will be listed vertically, not horizontally. We need to somehow get the class and activity information condensed into as few rows as possible. We need to "drill across" each set of results to find all classes and activities for each student (Figure 3). In effect, we are pivoting either the class information or the activity information up so that it appears alongside the student information horizontally.

 

102208CainFigure3.GIF

Figure 3: Drill across the result sets.

 

The big question is, how do we accomplish this with SQL against our simple relational data model? Part of the answer entails using a left outer join. Specifically, using the distinct set of students on the left and joining to their respective class and activity information on the right. By using a left outer join, the database engine will return the student information even if there is no corresponding class or activity for that student.

 

But how do we know if a student is associated with any activities or classes? And how do we determine the number of lines each student will need to represent their condensed list of classes and activities? Enter ROW_NUMBER OVER. With the ROW_NUMBER function, we can have DB2 assign a number to each intermediate result and use it later. Furthermore, if we use the OVER clause, PARTITION BY clause, and ORDER BY clause, we are able to assign distinct numbers to each result, by student, and have these results ordered by class or activity (Figure 4). We will include grouping criteria in case the student is signed up for the same class or activity more than once. The enhanced SQL statements would like this:

 

select      ROW_NUMBER() OVER (PARTITION BY s.studentname

ORDER BY c.class) AS row_num,

            s.studentname,

            c.class,

            sum(fee) as total_class_fee

from        students s,

            classes c

where       s.studentid = c.studentid

group by    s.studentname,

            c.class

order by    s.studentname,

            row_num;

select      ROW_NUMBER() OVER (PARTITION BY s.studentname

ORDER BY a.activity) AS row_num,

            s.studentname,

            a.activity,

            sum(fee) as total_activity_fee

from        students s,

            activities a

where       s.studentid = a.studentid

group by    s.studentname,

            a.activity

order by    s.studentname,

            row_num;

102208CainFigure4.GIF

Figure 4: The results are ordered and numbered.

Notice that each student now has his respective classes and/or activities ordered and numbered. Keep in mind that the row number is actually part of the result set.

 

To obtain the list of students who have classes, activities, or both, we can run the following statement against the results from classes and activities. Recall that UNION removes duplicates for the result set. Also notice that we want to order the results by studentname and the row number assigned by DB2 (Figure 5).

 

select      studentname,

            row_num

from        intermediate results from the previous query - Figure 4

union

select      studentname,

            row_num

from        intermediate results from the previous query - Figure 4

order by    studentname,

            row_num;

 


102208CainFigure5.GIF

Figure 5: These are the results of merging the lists together via UNION.

 

The result set tells us not only the students who have either classes or activities, but also how many lines it will take to show the information. As the next step in the process, this list of distinct studentnames and row numbers will be used to drill across the classes and activities via a left outer join.

 

select              t3.studentname,

t3.row_num,

t1.activity,

sum(t1.total_activity_fee),

t2.class,

sum(t2.total_class_fee)

from                intermediate results from the previous query - Figure 5

left outer join intermediate results from the previous query - Figure 4

on (t3.studentname = t1.studentname

and t3.row_num = t1.row_num)

left outer join intermediate results from the previous query - Figure 4

on (t3.studentname = t2.studentname

and t3.row_num = t2.row_num)

order by          t3.studentname,

                        t3.row_num

t1.activity,

t2.class;

The results of the left outer joins might look like this (Figure 6):

 

102208CainFigure6.GIF

Figure 6: Here are the results of a left outer join.

 

Now that we have each student's information condensed into the fewest number of lines required, we can turn our attention to providing class and activity fee totals by student and ordering the results (Figure 7). For this, we will use the new OLAP feature grouping sets. In our report, we have three sets of grouping criteria to sum up: fees for each line of the report, fees for each student, and grand totals for the report--namely (studentname, row_num, activity, class), (studentname), and (). The SQL used to drill across and aggregate the data looks like this:

 

select            t3.studentname,

t1.activity,

sum(t1.total_activity_fee),

t2.class,

sum(t2.total_class_fee)

from        intermediate results from the previous query - Figure 5

left outer join intermediate results from the previous query - Figure 4

on (t3.studentname = t1.studentname

and t3.row_num = t1.row_num)

left outer join intermediate results from the previous query - Figure 4

on (t3.studentname = t2.studentname

and t3.row_num = t2.row_num)

group by grouping sets ((t3.studentname, t1.activity, t2.class),

(t3.studentname),

())

order by    t3.studentname,

t1.activity,

t2.class;

102208CainFigure7.GIF 

Figure 7: Now, we have class and activity fee totals by student.

 

Now that we have the various pieces and parts to arrive at our final data set, let's put it all together. While we certainly could use multiple SQL statements and actual intermediate tables to do this, let's use the power of DB2 to do all of this work with one SQL statement. To do this, we will take advantage of common table expressions. Recall that a common table expression permits us to define a result table (with a table identifier) that can be specified as a table name in any FROM clause of the full select that follows. This will allow us to define and build all of our various intermediate results and then perform the left outer join and grouping--in one SQL request. Doing this also has the advantage of allowing the DB2 for i optimizer and database engine to perform all the work in microcode, eliminating the need to materialize true temporary tables. It is also possible that there will be additional benefit from overlapping I/O and data reuse. We should also add the OPTIMIZE FOR ALL ROWS clause because our application will fetch and consume all of the query results without waiting. Here is the single SQL statement to do all the work I have described previously:

 

with t1 as

(

select ROW_NUMBER() OVER (PARTITION BY s.studentname ORDER BY a.activity) AS row_num,

s.studentname,

a.activity,

sum(fee) as total_activity_fee

from students s, activities a

where s.studentid = a.studentid

group by s.studentname, a.activity

),

t2 as

(

select ROW_NUMBER() OVER (PARTITION BY s.studentname ORDER BY c.class) AS row_num,

s.studentname,

c.class,

sum(fee) as total_class_fee

from students s, classes c

where s.studentid = c.studentid

group by s.studentname, c.class

),

t3 as

(

select studentname, row_num

from t1

union

select studentname, row_num

from t2

)

select t3.studentname, t1.activity, sum(t1.total_activity_fee) as Total_Activity_Fee, t2.class,

sum(t2.total_class_fee) as Total_Class_Fee

from t3

left outer join t1 on (t3.studentname = t1.studentname AND t3.row_num = t1.row_num)

left outer join t2 on (t3.studentname = t2.studentname AND t3.row_num = t2.row_num)

group by grouping sets (

(t3.studentname, t3.row_num, t1.activity, t2.class),

(t3.studentname),

()

)

order by t3.studentname, t1.activity, t2.class

optimize for all rows;

Now that we have our SQL statement built and tested, let's turn our attention to performance. Using the advanced System i Navigator tools such as Visual Explain, we can see the "out of the box" plan for this query (Figure 8). We can also make use of the index advice provided automatically by DB2 for i.

 

102208CainFigure8.GIF

Figure 8: This is the plan for our query.

 

Focusing on the top, rightmost nodes, we can see that the plan involves full scans on the tables: students, activities, and classes (Figure 9). This makes sense, given there are no local selection predicates for any of the tables. The inner join order is students, then activities and students, and then classes. To accomplish the joins, hash tables are built and probed. This makes sense, given there are no indexes available to facilitate joins by key.

 

102208CainFigure9.GIF

Figure 9: We accomplish the joins by using hash tables.

 

The results of the inner joins are sorted and placed in temporary lists. These sorted lists will be used to support the row-number processing. This is represented by the ranking nodes. The intermediate results are then combined via the UNION (Figure 10).

 

102208CainFigure10.GIF

Figure 10: The intermediate results are combined via the UNION.

 

The results of the UNION operation are then left outer joined on studentname and row_num to the earlier intermediate results (Figure 11). These results are sorted and placed in a temporary list with duplicates removed.

 

102208CainFigure11.GIF

Figure 11: The UNION results are left outer joined on studentname and row_num.


The final, single, sorted list will be used to support the three levels of aggregations specified by the grouping sets. All three totals are calculated with one pass of the data coming from the previous join results (Figure 12). The aggregated results are sorted based on the order by criteria and delivered to the user.

 

102208CainFigure12.GIF

Figure 12: The totals are calculated.

 

To tune up our query, we can take advantage of the DB2 for i index advisor. The indexes being advised can be found in the table QSYS2.SYSIXADV by using the Navigator. In our case, we are invoking the advisor from Visual Explain and can see that three indexes are being suggested:

 

102208CainFigure13.GIF
Figure 13: The index advisor suggests three indexes.

 

To support the joins between students and their classes and activities, we would provide indexes on the respective join columns, such as this:

 

create index students_ix1 on students (studentid);

create index activities_ix1 on activities (studentid);

create index classes_ix1 on classes (studentid);

These indexes will give the DB2 for i query optimizer more information about the data and provide another option for joining the tables--namely, nested loop join by key. Ultimately, the optimizer will choose what it calculated as the fastest plan for your query, against your data, on your system.

 

If for our report we wanted to select only one student, by name, we would modify the query to have a local selection predicate. We should also provide additional indexes to support the local selection-plus-join conditions. Our new SQL request might look like this:

 

with t1 as

(

select ROW_NUMBER() OVER (PARTITION BY s.studentname ORDER BY a.activity) AS row_num,

s.studentname,

a.activity,

sum(fee) as total_activity_fee

from students s, activities a

where s.studentid = a.studentid

group by s.studentname, a.activity

),

t2 as

(

select ROW_NUMBER() OVER (PARTITION BY s.studentname ORDER BY c.class) AS row_num,

s.studentname,

c.class,

sum(fee) as total_class_fee

from students s, classes c

where s.studentid = c.studentid

group by s.studentname, c.class

),

t3 as

(

select studentname, row_num

from t1

union

select studentname, row_num

from t2

)

select t3.studentname, t1.activity, sum(t1.total_activity_fee) as Total_Activity_Fee, t2.class,

sum(t2.total_class_fee) as Total_Class_Fee

from t3

left outer join t1 on (t3.studentname = t1.studentname AND t3.row_num = t1.row_num)

left outer join t2 on (t3.studentname = t2.studentname AND t3.row_num = t2.row_num)

where t3.studentname = 'Mike Cain'

group by grouping sets (

(t3.studentname, t3.row_num, t1.activity, t2.class),

(t3.studentname),

()

)

order by t3.studentname, t1.activity, t2.class

optimize for all rows;

And to avoid full table scans, our supporting indexes would be these:

create index students_ix2 on students (studentname, studentid);

create index activities_ix2 on activities (studentname, studentid);

create index classes_ix2 on classes (studentname, studentid);

For this scenario, it would be perfectly acceptable to have all six indexes in place. These indexes will be available to handle the request for all students, as well as a particular set of students, as specified in the WHERE clause.

 

Furthermore, if this report would be expected to query, process, and return a much larger set of rows, then the optional DB2 Symmetric Multiprocessing feature could be employed to provide parallel processing. This might entail parallel table scans to read the base tables and populate the hash tables, as well as run the joins in parallel. If multiple CPUs and a supporting I/O subsystem were in place, SMP could possibly provide a significant performance increase and lower the response time.

 

As you can see, SQL is a very powerful and robust language. Getting the most benefit out DB2 for i requires keeping the concept of "set at time" in mind and making use of all the latest features and functions. You can use this drill-across technique to handle situations where you need to place related or unrelated figures next to each other horizontally, such as drilling across and combining measures from two or more fact tables by subject.

 

For more information and details regarding the DB2 for i OLAP functions, be sure to review the 6.1 DB2 for i SQL Reference.

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$

Book Reviews

Resource Center

  • SB Profound WC 5536 Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application. You can find Part 1 here. In Part 2 of our free Node.js Webinar Series, Brian May teaches you the different tooling options available for writing code, debugging, and using Git for version control. Brian will briefly discuss the different tools available, and demonstrate his preferred setup for Node development on IBM i or any platform. Attend this webinar to learn:

  • SB Profound WP 5539More than ever, there is a demand for IT to deliver innovation. Your IBM i has been an essential part of your business operations for years. However, your organization may struggle to maintain the current system and implement new projects. The thousands of customers we've worked with and surveyed state that expectations regarding the digital footprint and vision of the company are not aligned with the current IT environment.

  • SB HelpSystems ROBOT Generic IBM announced the E1080 servers using the latest Power10 processor in September 2021. The most powerful processor from IBM to date, Power10 is designed to handle the demands of doing business in today’s high-tech atmosphere, including running cloud applications, supporting big data, and managing AI workloads. But what does Power10 mean for your data center? In this recorded webinar, IBMers Dan Sundt and Dylan Boday join IBM Power Champion Tom Huntington for a discussion on why Power10 technology is the right strategic investment if you run IBM i, AIX, or Linux. In this action-packed hour, Tom will share trends from the IBM i and AIX user communities while Dan and Dylan dive into the tech specs for key hardware, including:

  • Magic MarkTRY the one package that solves all your document design and printing challenges on all your platforms. Produce bar code labels, electronic forms, ad hoc reports, and RFID tags – without programming! MarkMagic is the only document design and print solution that combines report writing, WYSIWYG label and forms design, and conditional printing in one integrated product. Make sure your data survives when catastrophe hits. Request your trial now!  Request Now.

  • SB HelpSystems ROBOT GenericForms of ransomware has been around for over 30 years, and with more and more organizations suffering attacks each year, it continues to endure. What has made ransomware such a durable threat and what is the best way to combat it? In order to prevent ransomware, organizations must first understand how it works.

  • SB HelpSystems ROBOT GenericIT security is a top priority for businesses around the world, but most IBM i pros don’t know where to begin—and most cybersecurity experts don’t know IBM i. In this session, Robin Tatam explores the business impact of lax IBM i security, the top vulnerabilities putting IBM i at risk, and the steps you can take to protect your organization. If you’re looking to avoid unexpected downtime or corrupted data, you don’t want to miss this session.

  • SB HelpSystems ROBOT GenericCan you trust all of your users all of the time? A typical end user receives 16 malicious emails each month, but only 17 percent of these phishing campaigns are reported to IT. Once an attack is underway, most organizations won’t discover the breach until six months later. A staggering amount of damage can occur in that time. Despite these risks, 93 percent of organizations are leaving their IBM i systems vulnerable to cybercrime. In this on-demand webinar, IBM i security experts Robin Tatam and Sandi Moore will reveal:

  • FORTRA Disaster protection is vital to every business. Yet, it often consists of patched together procedures that are prone to error. From automatic backups to data encryption to media management, Robot automates the routine (yet often complex) tasks of iSeries backup and recovery, saving you time and money and making the process safer and more reliable. Automate your backups with the Robot Backup and Recovery Solution. Key features include:

  • FORTRAManaging messages on your IBM i can be more than a full-time job if you have to do it manually. Messages need a response and resources must be monitored—often over multiple systems and across platforms. How can you be sure you won’t miss important system events? Automate your message center with the Robot Message Management Solution. Key features include:

  • FORTRAThe thought of printing, distributing, and storing iSeries reports manually may reduce you to tears. Paper and labor costs associated with report generation can spiral out of control. Mountains of paper threaten to swamp your files. Robot automates report bursting, distribution, bundling, and archiving, and offers secure, selective online report viewing. Manage your reports with the Robot Report Management Solution. Key features include:

  • FORTRAFor over 30 years, Robot has been a leader in systems management for IBM i. With batch job creation and scheduling at its core, the Robot Job Scheduling Solution reduces the opportunity for human error and helps you maintain service levels, automating even the biggest, most complex runbooks. Manage your job schedule with the Robot Job Scheduling Solution. Key features include:

  • LANSA Business users want new applications now. Market and regulatory pressures require faster application updates and delivery into production. Your IBM i developers may be approaching retirement, and you see no sure way to fill their positions with experienced developers. In addition, you may be caught between maintaining your existing applications and the uncertainty of moving to something new.

  • LANSAWhen it comes to creating your business applications, there are hundreds of coding platforms and programming languages to choose from. These options range from very complex traditional programming languages to Low-Code platforms where sometimes no traditional coding experience is needed. Download our whitepaper, The Power of Writing Code in a Low-Code Solution, and:

  • LANSASupply Chain is becoming increasingly complex and unpredictable. From raw materials for manufacturing to food supply chains, the journey from source to production to delivery to consumers is marred with inefficiencies, manual processes, shortages, recalls, counterfeits, and scandals. In this webinar, we discuss how:

  • The MC Resource Centers bring you the widest selection of white papers, trial software, and on-demand webcasts for you to choose from. >> Review the list of White Papers, Trial Software or On-Demand Webcast at the MC Press Resource Center. >> Add the items to yru Cart and complet he checkout process and submit

  • Profound Logic Have you been wondering about Node.js? Our free Node.js Webinar Series takes you from total beginner to creating a fully-functional IBM i Node.js business application.

  • SB Profound WC 5536Join us for this hour-long webcast that will explore:

  • Fortra IT managers hoping to find new IBM i talent are discovering that the pool of experienced RPG programmers and operators or administrators with intimate knowledge of the operating system and the applications that run on it is small. This begs the question: How will you manage the platform that supports such a big part of your business? This guide offers strategies and software suggestions to help you plan IT staffing and resources and smooth the transition after your AS/400 talent retires. Read on to learn: