I’m pretty sure you’ve gotten it wrong in numerous ways, so far. And you probably won’t get it right any time soon. So why waste your precious time on SQL tweaking, when you could be implementing business logic?
Let me explain…
It hasn’t been until the recent SQL:2008 standard that what MySQL users know as
LIMIT .. OFFSET was standardised into the following simple statement:
SELECT * FROM BOOK OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY
Yes. So many keywords.
SQL is indeed a very verbose language. Personally, we really like the conciseness of MySQL’s / PostgreSQL’s
LIMIT .. OFFSET clause, which is why we chose that for the jOOQ DSL API.
SELECT * FROM BOOK LIMIT 1 OFFSET 2
Now, when you’re a SQL framework vendor, or when you’re rolling your own, in-house SQL abstraction, you might think about standardising this neat little clause. Here’s a couple of flavours from databases that natively support offset pagination:
-- MySQL, H2, HSQLDB, Postgres, and SQLite SELECT * FROM BOOK LIMIT 1 OFFSET 2 -- CUBRID supports a MySQL variant of the -- LIMIT .. OFFSET clause SELECT * FROM BOOK LIMIT 2, 1 -- Derby, SQL Server 2012, Oracle 12, SQL:2008 SELECT * FROM BOOK OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY -- Ingres. Eek, almost the standard. Almost! SELECT * FROM BOOK OFFSET 2 FETCH FIRST 1 ROWS ONLY -- Firebird SELECT * FROM BOOK ROWS 2 TO 3 -- Sybase SQL Anywhere SELECT TOP 1 ROWS START AT 3 * FROM BOOK -- DB2 (without OFFSET) SELECT * FROM BOOK FETCH FIRST 1 ROWS ONLY -- Sybase ASE, SQL Server 2008 (without OFFSET) SELECT TOP 1 * FROM BOOK
So far, so good. These can all be handled. Some databases put offsets before limits, others put limits before offsets, and the T-SQL family puts the whole
TOP clause before the
SELECT list. This is easy to emulate. Now what about:
- Oracle 11g and less
- SQL Server 2008 and less
- DB2 with OFFSET
When you google for this, you will find millions of ways to emulate
OFFSET .. FETCH in those older databases. The optimal solutions always involve:
- Using doubly-nested derived tables with
ROWNUMfiltering in Oracle
- Using single-nested derived tabels with
ROW_NUMBER()filtering in SQL Server and DB2
So you’re emulating it.
Do you think you will get it right?
Let us go through a couple of issues that you may not have thought about.
First off, Oracle. Oracle obviously wanted to create a maximum vendor-lockin, which is only exceeded by Apple’s recent introduction of Swift. This is why
ROWNUM solutions perform best, even better than SQL:2003 standard window function based solutions. Don’t believe it? Read this very interesting article on Oracle offset pagination performance.
So, the optimal solution in Oracle is:
-- PostgreSQL syntax: SELECT ID, TITLE FROM BOOK LIMIT 1 OFFSET 2 -- Oracle equivalent: SELECT * FROM ( SELECT b.*, ROWNUM rn FROM ( SELECT ID, TITLE FROM BOOK ) b WHERE ROWNUM <= 3 -- (1 + 2) ) WHERE rn > 2
So that’s really the equivalent?
Of course not. You’re selecting an additional column, the
rn column. You might just not care in most cases, but what if you wanted to make a limited subquery to be used with an
-- PostgreSQL syntax: SELECT * FROM BOOK WHERE AUTHOR_ID IN ( SELECT ID FROM AUTHOR LIMIT 1 OFFSET 2 ) -- Oracle equivalent: SELECT * FROM BOOK WHERE AUTHOR_ID IN ( SELECT * -- Ouch. These are two columns! FROM ( SELECT b.*, ROWNUM rn FROM ( SELECT ID FROM AUTHOR ) b WHERE ROWNUM <= 3 ) WHERE rn > 2 )
So, as you can see, you’ll have to do some more sophisticated SQL transformation. If you’re manually emulating
LIMIT .. OFFSET, then you might just patch the
ID column into the subquery:
SELECT * FROM BOOK WHERE AUTHOR_ID IN ( SELECT ID -- better FROM ( SELECT b.ID, ROWNUM rn -- better FROM ( SELECT ID FROM AUTHOR ) b WHERE ROWNUM <= 3 ) WHERE rn > 2 )
So, that’s more like it, right? But since you’re not writing this manually every time, you’re about to start creating your own nifty in-house SQL framework covering the 2-3 use cases that you’ve encountered so far, right?
You can do it. So you’ll regex-search-replace column names automagically to produce the above.
So now, it is correct?
Of course not! Because you can have ambiguous column names in top-level
SELECTs, but not in nested selects. What if you want to do this:
-- PostgreSQL syntax: -- Perfectly valid repetition of two ID columns SELECT BOOK.ID, AUTHOR.ID FROM BOOK JOIN AUTHOR ON BOOK.AUTHOR_ID = AUTHOR.ID LIMIT 1 OFFSET 2 -- Oracle equivalent: SELECT * FROM ( SELECT b.*, ROWNUM rn FROM ( -- Ouch! ORA-00918: column ambiguously defined SELECT BOOK.ID, AUTHOR.ID FROM BOOK JOIN AUTHOR ON BOOK.AUTHOR_ID = AUTHOR.ID ) b WHERE ROWNUM <= 3 ) WHERE rn > 2
Nope. And the trick of manually patching ID columns from the previous example doesn’t work, because you have multiple
ID instances. And renaming the columns to random values is nasty, because the user of your home-grown in-house database framework wants to receive well-defined column names. I.e.
So, the solution is to rename the columns twice. Once in each derived table:
-- Oracle equivalent: -- Rename synthetic column names back to original SELECT c1 ID, c2 ID FROM ( SELECT b.c1, b.c2, ROWNUM rn FROM ( -- synthetic column names here SELECT BOOK.ID c1, AUTHOR.ID c2 FROM BOOK JOIN AUTHOR ON BOOK.AUTHOR_ID = AUTHOR.ID ) b WHERE ROWNUM <= 3 ) WHERE rn > 2
But now, we’re done?
Of course not! What if you doubly nest such a query? Will you think about doubly renaming
ID columns to synthetic names, and back? … Let’s leave it here and talk about something entirely different:
Does the same thing work for SQL Server 2008?
Of course not! In SQL Server 2008, the most popular approach is to use window functions. Namely,
ROW_NUMBER(). So, let’s consider:
-- PostgreSQL syntax: SELECT ID, TITLE FROM BOOK LIMIT 1 OFFSET 2 -- SQL Server equivalent: SELECT b.* FROM ( SELECT ID, TITLE, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
So that’s it, right?
Of course not!
OK, we’ve already had this issue. We should not select
*, because that would generate too many columns in the case that we’re using this as a subquery for an
IN predicate. So let’s consider the correct solution with synthetic column names:
-- SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT ID c1, TITLE c2, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
But now we got it, right?
Make an educated guess: Nope!
What happens, if you add an
ORDER BY clause to the original query?
-- PostgreSQL syntax: SELECT ID, TITLE FROM BOOK ORDER BY SOME_COLUMN LIMIT 1 OFFSET 2 -- Naive SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT ID c1, TITLE c2, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ORDER BY SOME_COLUMN ) b WHERE rn > 2 AND rn <= 3
Now, that doesn’t work in SQL Server. Subqueries are not allowed to have an
ORDER BY clause, unless they also have a
TOP clause (or an
OFFSET .. FETCH clause in SQL Server 2012).
OK, we can probably tweak this using
TOP 100 PERCENT to make SQL Server happy.
-- Better SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT TOP 100 PERCENT ID c1, TITLE c2, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ORDER BY SOME_COLUMN ) b WHERE rn > 2 AND rn <= 3
Now, that’s correct SQL according to SQL Server, although you do not have a guarantee that the ordering of the derived table will survive after query execution. It may well be that the ordering is changed again by some influence.
If you wanted to order by
SOME_COLUMN in the outer query, you’d have to again transform the SQL statement to add another synthetic column:
-- Better SQL Server equivalent: SELECT b.c1 ID, b.c2 TITLE FROM ( SELECT TOP 100 PERCENT ID c1, TITLE c2, SOME_COLUMN c99, ROW_NUMBER() OVER (ORDER BY ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3 ORDER BY b.c99
That does start getting a bit nasty. And let’s guess whether:
This is the correct solution!
Of course not! What if the original query had
DISTINCT in it?
-- PostgreSQL syntax: SELECT DISTINCT AUTHOR_ID FROM BOOK LIMIT 1 OFFSET 2 -- Naive SQL Server equivalent: SELECT b.c1 AUTHOR_ID FROM ( SELECT DISTINCT AUTHOR_ID c1, ROW_NUMBER() OVER (ORDER BY AUTHOR_ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
Now, what happens if an author has written several books? Yes, the
DISTINCT keyword should remove such duplicates, and effectively, the PostgreSQL query will correctly remove duplicates first, and then apply
ROW_NUMBER() predicate always generates distinct row numbers before
DISTINCT can remove them again. In other words,
DISTINCT has no effect.
-- Better SQL Server equivalent: SELECT b.c1 AUTHOR_ID FROM ( SELECT DISTINCT AUTHOR_ID c1, DENSE_RANK() OVER (ORDER BY AUTHOR_ID) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
Read more about this trick here:
Watch out that the
ORDER BY clause must contain all columns from the
SELECT field list. Obviously, this will limit the acceptable columns in the
SELECT DISTINCT field list to columns that are allowed in a window function’s
ORDER BY clause (e.g. no other window functions).
We could of course try to fix that as well using common table expressions, or we consider
Yet another issue??
Yes, of course!
Do you even know what the column(s) in the window function’s
ORDER BY clause should be? Have you just picked any column, at random? What if that column doesn’t have an index on it, will your window function still perform?
The answer is easy when your original
SELECT statement also has an
ORDER BY clause, then you should probably take that one (plus all the columns from the
SELECT DISTINCT clause if applicable).
But what if you don’t have any
ORDER BY clause?
Yet another trick! Use a “constant” variable:
-- Better SQL Server equivalent: SELECT b.c1 AUTHOR_ID FROM ( SELECT AUTHOR_ID c1, ROW_NUMBER() OVER (ORDER BY @@version) rn FROM BOOK ) b WHERE rn > 2 AND rn <= 3
Yes, you need to use a variable, because constants are not allowed in those
ORDER BY clauses, in SQL Server. Painful, I know.
Are we done yet!?!?
Probably not! But we have probably covered around 99% of the common and edge cases. We can sleep nicely, now.
Note that all of these SQL transformations are implemented in jOOQ. jOOQ is the only SQL abstraction framework that takes SQL seriously (with all its warts and caveats), standardising over all of this madness.
As mentioned in the beginning, with jOOQ, you just write:
// Don't worry about general emulation select().from(BOOK).limit(1).offset(2); // Don't worry about duplicate column names // in subselects select(BOOK.ID, AUTHOR.ID) .from(BOOK) .join(AUTHOR) .on(BOOK.AUTHOR_ID.eq(AUTHOR.ID)) .limit(1).offset(2); // Don't worry about invalid IN predicates select() .from(BOOK) .where(BOOK.AUTHOR_ID).in( select(AUTHOR.ID) .from(AUTHOR) .limit(1).offset(2) ); // Don't worry about the ROW_NUMBER() vs. // DENSE_RANK() distinction selectDistinct(AUTHOR_ID) .from(BOOK).limit(1).offset(2);
With jOOQ, you can just write your Oracle SQL or Transact SQL as if it were as awesome as PostgreSQL! … without jumping the SQL ship entirely, and moving on to JPA.
Now, of course, if you have been reading our blog, or our partner blog SQL Performance Explained, you should know by now that
OFFSET pagination is often a bad choice in the first place. You should know that keyset pagination almost always outperforms
|Reference:||Stop Trying to Emulate SQL OFFSET Pagination with Your In-House DB Framework! from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog.|
Get ready to program in a whole new way!
Functional Programming in Java will help you quickly get on top of the new, essential Java 8 language features and the functional style that will change and improve your code. This short, targeted book will help you make the paradigm shift from the old imperative way to a less error-prone, more elegant, and concise coding style that’s also a breeze to parallelize. You’ll explore the syntax and semantics of lambda expressions, method and constructor references, and functional interfaces. You’ll design and write applications better using the new standards in Java 8 and the JDK.