
The
“Top Ten” Problem
by Craig S. Mullins
A
common problem faced by application developers is the requirement to retrieve a
limited number of qualifying rows from the database – for example, to select
the ten largest deposits or the five smallest balances. The first reaction of
most programmers is to simply use the WHERE clause to eliminate non-qualifying
rows. But this is simplistic, and often is not sufficient to produce the results
desired in an optimal manner.
For
example, what if the program requires that only
the top ten results be returned? This can be a difficult request to
formulate using SQL alone. Consider, for example, an application that needs to
retrieve only the top ten highest paid employees from the EMP table in the DB2
sample database. You could simply issue a SQL request that retrieves all of the
books in order by price, but only use the first ten retrieved. That is easy, for
example:
SELECT
EMPNO, LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP
ORDER
BY SALARY DESC;
But
that does not satisfy the requirements of retrieving only the top ten. It
merely returns every row with the results sorted into descending sequence. So
the results would still be all employees in the table, but in the correct order
so you can view the “top ten” salaries easily. When using this approach be
sure to specify the ORDER BY clause with the DESC key word. This sorts the
results into descending order, instead of the default, which is ascending.
Without the DESC key word, the top of the list would contain the lowest paid
employees and the “top ten” would be at the very end of the results set.
For
many users this approach of ordering the desired results to the top may be
sufficient. But it is not a “complete” solution that meets the
specifications of returning only the “top ten” using only SQL. The ideal
solution should return only the top ten employees with the highest salaries. There are
several DB2 solutions that can be used to produce this result. Of course, you
could implement the SQL in a cursor and programmatically return only the first
ten rows. But this would require host language programming skills and an
application program would need to be run each time that the results are
required.
DB2 Version 7 provides an easy way to limit the results
of a SELECT statement using a new clause – the FETCH FIRST n ROWS clause. When the FETCH FIRST n ROWS clause is specified, DB2 will limit the number of rows that
are fetched and returned by a SELECT statement. This Version 7 approach requires
SQL only and is quite simple and efficient. The FETCH FIRST n
ROWS ONLY clause is appended right to the end of the SELECT statement. It is
used as follows:
SELECT
EMPNO,
LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP
ORDER
BY SALARY DESC
FETCH
FIRST 10 ROWS ONLY;
Of
course, the value can be any number – not just 10. For example, to retrieve
only the top 4 salaries you would code:
SELECT
EMPNO,
LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP
ORDER
BY SALARY DESC
FETCH
FIRST 4 ROWS ONLY;
This
is the simplest and probably the most elegant solution for limiting the number
of rows returned by a DB2 query. This clause is different from the OPTIMIZE FOR n
ROWS clause that has been available for several releases of DB2 now. The
FETCH FIRST n ROWS ONLY clause will limit the number of rows returned to the
specified number, n. Remember that the
OPTIMIZE FOR n ROWS clause does not
impact the number of rows returned, but is used only by the optimizer for
optimizing SQL.
But,
the FETCH FIRST n ROWS ONLY clause
requires you to be running at least DB2 Version 7 and you might not be running
at that level. In that case, another approach is required. One approach is to
use the COUNT function and some “tricky” SQL. The following SQL will also
return the top ten employees by salary:
SELECT
EMPNO,
LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP A
WHERE
10 > (SELECT COUNT(*)
FROM
DSN8710.EMP B
WHERE
A.SALARY < B.SALARY
AND B.SALARY
IS NOT NULL)
ORDER
BY SALARY DESC;
Let's
break this query down into components to understand how it works. This SQL
statement uses a correlated subquery. A correlated subquery is an inner query
that must be evaluated for each row of the outer query. The query uses
aliases to identify the table references. The alias “A” identifies the table
in the outer query, so that in the subquery, the A.SALARY identifies
that column as belonging to the outer query's table. The alias “B” is used
for the subquery table (though it is not required).
So,
for each row of the outer query, the subquery counts the number of rows with a
larger score than that of the outer row under consideration. If there are fewer
than 10 rows with a larger score, then the outer row satisfies the outer WHERE
clause – in other words, it belongs in the top ten.
The ORDER BY clause is required to sort the results in
the right order. If it is removed from the query, the results will still contain
the top ten, but they may be in no particular order. Additionally, this query
works for all DB2 versions and platforms (mainframe, Unix, and NT) and it is
portable from DB2 to other database servers, such as SQL Server and Oracle. And,
of course, you can change the constant 10 to any number you wish, thereby
retrieving the top 20, or top 5, as deemed necessary by the needs of your
application.
That
does not mean to suggest that this query is without problems – indeed, it can
be quite inefficient. This particular SQL statement uses a correlated subquery.
The number of rows counted will grow exponentially as the number of rows in the
table grows. It can be quite inefficient when the number of rows grows to as
little as a thousand. For very small amounts of data though, this query usually
performs quite well.
But there is another difference between this query and
the previous one – and that is the way that “ties” are handled. A tie
occurs when more than one row contains the same value. This query may return
more than 10 rows if there are multiple rows with the same value for price
within the top ten. The previous query that used the FETCH FIRST n ROWS ONLY clause always will limit the number of rows returned to n,
even if there are other rows with the same value for price as the number n
row in the results set. The needs of your application will dictate whether ties
are to be ignored or included in the result set. If ties should not be included
in the results set, do not use the last SQL formulation because it will include
ties.
One final approach to the “top ten” problem is the
brute force method. This method relies on systematically identifying the largest
value yet to be returned. It works kind of like peeling back the layers of an
onion. The following example uses the brute force method to return the top ten
salaries from the EMP table:
SELECT
EMPNO, LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP
WHERE
SALARY = (SELECT MAX(SALARY)
FROM DSN8710.EMP)
UNION
ALL
SELECT
EMPNO, LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP
WHERE
SALARY =
(SELECT
MAX(SALARY) FROM DSN8710.EMP
WHERE SALARY < (SELECT MAX(SALARY)
FROM DSN8710.EMP))
UNION
ALL
SELECT
EMPNO, LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP
WHERE
SALARY =
(SELECT
MAX(SALARY) FROM DSN8710.EMP
WHERE SALARY < SELECT MAX(SALARY)
FROM DSN8710.EMP
WHERE
SALARY <
(SELECT MAX(SALARY)
FROM DSN8710.EMP))
UNION
ALL
.
. .
ORDER
BY SALARY DESC;
You
get the picture. The first query in the UNION ALL statement simply retrieves the
largest salary. The second query retrieves the largest salary that is less than
the first salary retrieved. The third query retrieves the largest salary that is
less than the second salary retrieved… and so on until you have retrieved the n
largest salaries. Be sure to include the ORDER BY clause to return the rows
in descending order by salary.
Actually,
this method of obtaining the top ten values is a little bit different than the
other methods we have discussed, too. It actually returns the top ten distinct
values – no duplicate salary values will be returned when using the brute
force method. When multiple salary values fall within the requested range the
results will show only one of the employees that qualify. This can be confusing
because the query may return different employees each time it is run depending
on the access path chosen.
Additionally,
the brute force method is not generally recommended because it can quickly
become quite cumbersome to code; and making changes to such an unwieldy SQL
statement can be very confusing. Furthermore, it is likely that the brute force
method will not perform optimally due to all of the MAX functions and subselects
needed in the multiple UNION statements.
Want
to Return the Bottom Ten?
Any of these queries can be modified to return the bottom
ten instead of the top ten. For the first query, simply remove the DESC from the
ORDER BY clause. This will cause the rows to be sorting into ascending sequence,
which is the default. Then the FETCH FIRST n ROWS ONLY will cause the bottom ten results to be returned.
For
the middle query, using standard SQL alone, simply reverse the less than sign
(>) to a greater than sign (<) in the subquery, and remove the DESC from
the ORDER BY clause, as follows:
SELECT
EMPNO, LASTNAME, FIRSTNME, SALARY
FROM
DSN8710.EMP A
WHERE
10 > (SELECT COUNT(*)
FROM DSN8710.EMP B
WHERE A.SALARY > B.SALARY
AND B.SALARY IS NOT NULL)
ORDER
BY SALARY;
And with the brute force method, you simply can deploy
the same method but using the MIN function and greater than predicates in place
of the MAX function and less than predicates.
Bottom Line
The
“top ten” request is a common application requirement. Any application that
needs to return an ordered subset of a given table can take advantage of one of
these “top ten” queries. Consider using SQL to return only the results you
need instead of writing an application program that reads query results to limit
the results set. “SQL only” solutions can be easier to use than bulky
application programs. But be aware that the “SQL only” approach, depending
on the method deployed, also may be less efficient.
From DB2 Update (Xephon) May 2002.
© 2002 Craig S. Mullins, All rights reserved.
Home.

|