Software Development

ID Lists Aren’t the Best Solution for the N+1 Problem

In their eternal attempts to circumvent the N+1 problem, Hibernate users often resort to IN predicates with ID lists. In this post, we’ll see how those users might just be replacing a horrible thing with a bad one, which is better but not yet good. Here’s why:

The N+1 Problem

The N+1 problem is a well understood issue, documented in various blog posts. The previously linked article shows the following set of queries to explain the nature of this problem:
 
 

SELECT id, name FROM albums
SELECT id, name FROM songs WHERE album_id = 1
SELECT id, name FROM songs WHERE album_id = 2
SELECT id, name FROM songs WHERE album_id = 3
SELECT id, name FROM songs WHERE album_id = 4
SELECT id, name FROM songs WHERE album_id = 5

This set of queries is often produced by ORMs such as Hibernate, when entities are configured to be lazy fetched.

The article also tackles the problem by replacing the second set of N=5 queries by a single query with an IN predicate:

SELECT id, title, filename FROM songs
WHERE album_id IN (1, 2, 3, 4, 5)

This will reduce the number of queries from N+1 to 1+1, which is certainly faster.

But let’s look at things from the SQL side

Is such an IN predicate with an ID List a good solution? It is certainly viable for very small lists. But when your list grows, consider these things:

IN list size

Not all databases support arbitrary lengths of IN lists. In particular the following limitations exist:

  • Oracle IN predicate: 1000 elements
  • Ingres: 1024 total bind values
  • SQLite: 999 total bind values
  • Sybase ASE: 2000 total bind values
  • SQL Server 2008 R2: 2100 total bind values

This is quite annoying as developers have to probably learn the above the hard way. If you’re using jOOQ, you can “safely” ignore the above constraints as jOOQ will rewrite your query such that:

  • Large Oracle IN predicates are split into several OR-connected IN predicates, or AND-connected NOT IN predicates
  • Large amounts of bind values are detected at SQL rendering time and replaced by inline values

Variable binding speed

There’s quite a bit of work involved with variable binding in some JDBC drivers. Essentially, some database protocols will need to transfer many values one-by-one to the database. This doesn’t happen when you inline bind values, as the only thing transferred is a single SQL string. Having too many bind values (I’m talking about 10k or more), is certainly not a good idea.

Cursor cache misses

Sophisticated databases such as Oracle maintain cursor caches, which can be leveraged for cursor sharing. This means that subsequent executions of identical SQL statements will profit from expensive execution plan calculations having been done already, along with cursor statistics being collected. Think about it this way:

-- This is the first time Oracle encounters this 
-- query. The DB has to parse the query and 
-- calculate an execution plan, which can be quite 
-- expensive if you have lots of JOINs
SELECT id, name FROM songs 
WHERE album_id IN (?, ?)

-- This is the second time Oracle encounters this 
-- same query. The DB can now re-use the previous
-- execution plan as it is likely to be optimal 
-- again
SELECT id, name FROM songs 
WHERE album_id IN (?, ?)

-- This is not the same query as the previous ones
-- A new execution plan has to be calculated
SELECT id, name FROM songs 
WHERE album_id IN (?, ?, ?)

As you can quickly see, the above example shows that an ID list in an IN predicate is a moving target, which is likely to remove the usefulness of bind values entirely, as each query is prone to produce new cursors and new execution plans in your database. You might as well have inlined your bind values, which would have even helped you prevent bind value peeking issues.

So what’s better than ID lists?

There are a number of things that are better. Note that not all of them may be suitable for your concrete problem, and not all of them will always outperform ID lists. Use common sense and maybe a load and/or performance test to be sure, which is the best query in your situation.

Explicit “eager” fetching, using JOINs

Sometimes, it would just be easier to denormalise the data in the database. Instead of fetching songs one by one, just fetch them along with the albums:

SELECT
  a.id a_id, 
  a.name a_name,
  s.id s_id,
  s.name s_name
FROM albums a
JOIN songs s ON s.album_id = a.id

This will transfer more data over the wire (repeating album information) in exchange for executing only a single query (reducing N+1 to 1). This is only good for slight denormalisations. If you JOIN dozens of 1:N relationships, you might not be happy with this solution.

Semi-joining the original query

If you can access the original query’s SQL code, just semi-join it when fetching songs! It’s simple:

SELECT id, name FROM songs 
WHERE album_id IN (
  SELECT id FROM albums
)

-- Or using EXISTS
SELECT s.id, s.name FROM songs s
WHERE EXISTS (
  SELECT 1 FROM albums a
  WHERE a.id = s.album_id
)

This will require some SQL transformation. Again, using a typesafe query builder / SQL builder to compose queries, such as jOOQ, JaQu or Criteria API, you may be able to implement such SQL transformation / SQL composition more easily.

Note that this is probably the fastest solution that you can choose, at least in sophisticated databases with powerful query optimisers.

Using arrays for ID lists

If you really cannot query your songs without an ID list, at least, use a single array as a bind variable as such (Oracle dialect):

SELECT id, name FROM songs 
WHERE album_id IN (
  SELECT * FROM TABLE(?)
)

The above syntax is Oracle-specific. Check out this Stack Overflow question for other alternatives. Note that Oracle’s VARRAY and TABLE types are strongly typed, i.e. you will have to have such a type, first:

CREATE TYPE numbers AS TABLE OF NUMBER(38);

Alternatively, you can use one of these “built-in” table types:

  • ORA_MINING_NUMBER_NT
  • ORA_MINING_VARCHAR2_NT

Creating discrete-sized IN lists

If your database doesn’t support arrays, and you need to rely on ID lists, there is one last option that you may have to avoid too many cursor cache misses and hard parses. Create discrete-sized IN lists, filling up the bind values to the next discrete length. Let’s assume lengths 2, 3, 5, 8, 13. This is best explained by example:

-- Of course, this only makes sense with bind values
-- Inlining is done for the purpose of the example
-- only

-- Two IDs   fill up to 2
album_id IN (1, 2)

-- Three IDs fill up to 3
album_id IN (1, 2, 3)

-- Four IDs  fill up to 5
album_id IN (1, 2, 3, 4, 4)

-- Five IDs  fill up to 5
album_id IN (1, 2, 3, 4, 5)

-- Six IDs   fill up to 8
album_id IN (1, 2, 3, 4, 5, 6, 6, 6)

There is no rule of thumb at what steps your IN list sizes should increase, so you might want to actually measure this.

Note!: You may use NULL to fill up IN lists of an IN predicate, but not of a NOT IN predicate. To learn more about this, read this blog post about NULL and NOT IN predicates.

TL;DR: Get back in control of your SQL

As soon as a decent amount of data is involved with your data processing, common ORM models may not be sufficient anymore, as it is very hard to tune such ORMs. You may need to resort to SQL and explicitly express your SQL statements in the most optimal way for your problem domain.
 

Lukas Eder

Lukas is a Java and SQL enthusiast developer. He created the Data Geekery GmbH. He is the creator of jOOQ, a comprehensive SQL library for Java, and he is blogging mostly about these three topics: Java, SQL and jOOQ.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button