It’s ubiquitous on data driven web sites: the result set pager. We’ve all used them whether we built them from scratch or used one provided by the framework.
Pagers are by nature performance suckers because we’re asking the database to re-run the same query for each “page”, slicing off just one set of contiguous rows for each page. If your result set is 10,000 rows long but you’re only paging through them 10 rows at a time, that’s potentially 1,000 database requests to view the entire set.
But it’s worse than that because in order to provide those nifty pager controls, like those in the image above, the software has to know how many rows are in the larger result set so it can do the math to populate the navigation for those page numbers. In other words, using the above example the software needs to know that there is a Page 14 to jump to.
A little background first. Internally, garden variety pagers are pretty much the same. They request a fixed number of rows to display per page, like 10. That becomes your LIMIT filter in the database query:
SELECT * FROM people LIMIT 10;
To create the page navigation you need to do some math to generate the OFFSET. For instance, using a page size of 10, the query for a Page 3 display would look like this:
SELECT * FROM people OFFSET 20 LIMIT 10;
That tells the database to send us 10 rows of results starting from the 20th row of the result set.
But how do we know there is a 20th row in the result set so we can pre-construct that “3″ link above? We need to know how many rows are in the larger (outer) result set. But COUNT(*) is useless because it only gives us the count of the filtered result set: 10 rows.
What the pager typically does is run the same query TWICE for each page, once without LIMIT/OFFSET (and any SORT) to get the count of the larger result set so it can construct those pager controls and once again with them to get the bounded result set for the page display. Now that 10,000 row result set in the above example would require 2,000 database requests. That’s expensive!
If you’re using PostgreSQL and at least version 8.4, you’re in luck. 8.4 is when PostgreSQL introduced the Window functions. Using the OVER() clause on an aggregate function like COUNT(*) will return the number of rows in the larger query before applying LIMIT and OFFSET:
SELECT *, COUNT(*) OVER() AS full_count FROM people OFFSET 20 LIMIT 10;
Now all you’ve got to do is modify your pager to extract the full_count column from the first row of the result set, i.e.
Then you can eliminate that count query, probably saving at least 40% of your overhead on each page.
There are additional things you can do to reduce pager overhead on the database. For instance, you could cache the entire result set in amemcached server or in APC shared memory and just page through that. Of course, you would need to develop a strategy to expire a/k/a “blow the cache on” those results when you’re finished with them. I’ll post more about that later. You can also adopt the “flowing wall” approach to presenting database content similar to that used on Facebook and Pinterest which eliminates the pager entirely but at the expense of being able to click to specific pages.
Here’s a non-hydrating pager I built for Symfony 1 which employs this PG Window technique. The requirement is that the source query contain this column in the result set:
COUNT(*) OVER() AS full_query_count