We’ve all done pagination – it’s sensible for both lists in the UI and for processing data in batches. “SELECT (columns) FROM table LIMIT X, Y”. Possibly “SELECT (columns) FROM table ORDER BY some_column LIMIT X, Y”.
What’s the caveat? Especially when working with an ORM and when processing batches, where order doesn’t matter, you are likely to omit the ORDER BY clause. You would expect the results to be ordered by the primary key. And they are in most cases (though it is not guaranteed), but when you EXPLAIN the query, you can see that the index is not used for the query – a full table scan is performed to fetch the results, which is slow. Note that if we omit the LIMIT clause, again a full table scan is used, but this is something you’d rarely do – i.e. a query with no WHERE clause and
So, rule #1: always include an ORDER BY clause in order to use the index.
For lists in the UI you’d usually order by some column – date modified, name, etc. That also performs a full table scan, unless you have an index on that column.
So, rule #2: always have an index on the column you ORDER BY
Kind of obvious things, but they may be silent performance issues lurking in your project. And now onto something not so obvious, that surprised me. When using LIMIT X, Y, MySQL scans the whole table (up until X+Y), regardless of the index. Note that even if the “type” (in the result of EXPLAIN) is not ALL (full table scan), but “index”, MySQL may still scan most of the rows – check the “rows” column. But it can also be misleading (in case of LIMIT it shows that rows = everything, but stops when it fills the result with the desired number of rows). So if you have LIMIT 100000, 50 MySQL will scan 100050 rows. This makes the query slow (the query time reflects that – the larger the offset, the longer the query). This is due to MySQL not maintaining an index on the row number. It cannot use the primary key, because even if it is auto_increment, it has gaps. Luckily, in huge tables you normally have a WHERE clause which forces the use of an index and reduces the number of rows to scan. Also, in a UI you will rarely find yourself querying for the 100000th record. But anyway it is something to consider, especially in the case of batch processing jobs which need to go through the whole table. You do that in batches, because 1. ORMs may store everything in memory and just kill your application 2. the transaction will become really long.
So, rule #3: when performing batch operations on tables with a lot of rows, do not use LIMIT. Use WHERE id > X and id < Y (or BETWEEN)
This does not guarantee that each batch will be the same size, but that doesn’t matter.
The rules above are “rule of thumbs”, rather than “must dos”, but it’s good to have them in mind. The bottom-line here is that you should analyze your performance, find specifically slow queries, then EXPLAIN them and see how they can be optimized and what would be the impact of the optimization.
P.S. Here’s a presentation on the low level details of MySQL engines regarding indexes.
This guide will introduce you to the world of Software Architecture!
This 162 page guide will cover topics within the field of software architecture including: software architecture as a solution balancing the concerns of different stakeholders, quality assurance, methods to describe and evaluate architectures, the influence of architecture on reuse, and the life cycle of a system and its architecture. This guide concludes with a comparison between the professions of software architect and software engineer.