Pagination is hard when your content rankings can change quickly, and even harder when those rankings differ per-user. (Let's treat infinite scroll as a type of pagination where the links are invisible.) There are two hard problems: newly-added content at the top, and reranked content.
Let's forget about newly-added content, and accept that you'll have to refresh page 1 to see it. Let's also pretend we're doing pure ORDER BY position; if you're ordering by something else, you might have to use window functions. Our page_size is 3; page isn't a real column, but is the calculated page_num based on page_size. Here are our posts:
+----+----------+-----------+------+
| id | position | animal | page |
+----+----------+-----------+------+
| 1 | 1 | Alpacas | 1 |
| 2 | 2 | Bats | 1 |
| 3 | 3 | Cows | 1 |
| 4 | 4 | Dogs | 2 |
| 5 | 5 | Elephants | 2 |
| 6 | 6 | Foxes | 2 |
+----+----------+-----------+------+
After we fetch page 1, and before we fetch page 2, Cows gets demoted from #3 to #4, while Dogs get promoted from #4 to #3. How do we handle it?
Offset/limit approach
This is the typical naive approach; in Rails, it's how will_paginate and Kaminari work. If I want to fetch page 2, I'll do
SELECT * FROM posts
ORDER BY posts.position
OFFSET ((:page_num - 1) * :page_size)
LIMIT :page_size;
which gets rows 3-6. I'll never see Dogs, and I'll see Cows twice.
Last seen ID approach
Reddit improves on this. Instead of calculating the first row based on page size, the client tracks the ID of the last item you've seen, like a bookmark. When you hit "next", they start looking from that bookmark onward:
SELECT * from posts
JOIN posts AS last
ON posts.id = last.id
AND last.
WHERE posts.id > :last_seen_id
ORDER BY posts.id;
On page 1 I saw Alpacas, Bats and Cows. Dogs then got promoted above Cows. When I fetch page 2, I fetch page_size rows starting with "the one after Cows", so I see Elephants and Foxes. There are no duplicates, but I still never got to see Dogs.
Server side state
HackerNews solves this with server-side continuations; they store the entire result set for you (or at least several pages in advance?), and the "More" link references that continuation. When I fetch page 2, I ask for "page 2 assuming that page 1 consisted of Alpacas, Bats and Cows". This means my page 2 is guaranteed to consist of dogs, elephants, and foxes; there are no duplicates, there are no missing rows, and all I lose is the knowledge that dogs and cows switched places. The downside is that I have to store a lot of state on the server; on HN, that's stored in RAM, and in reality those continuations often expire before you can press the "More" button, forcing you to go all the way back to page 1 to find a valid link.
Are these the only three possible approaches? If so, is there an intuitive proof that these ARE the only possibilities? If not, are there computer-science concepts that would give me Google juice to read about this? Are there ways to approximate the continuation approach without storing the entire result set? About all that comes to mind, long term, is event-streaming/point-in-time systems, where "the result set as of the moment I fetched page 1" is forever derivable. Short of that... suggestions? FWIW, we're on PostgreSQL at the moment.
We're going with the server-side state approach for now, caching the entire result on the first query so we always return a consistent list. This will work as long as our query already returns all rows; eventually we'll need to use a nearest-neighbor approach and that wont work.
But I think there's a fourth possibility, which scales very well, as long as:
- You don't need a guarantee of no duplicates, only a high likelihood
- You're okay with missing some content during scrolls, as long as you avoid duplicates
The solution is a variant of the "last seen ID" solution: Have the client keep not one, but 5 or 10 or 20 bookmarks - few enough that you can store them efficiently. The query ends up looking like:
SELECT * FROM posts
WHERE id > :bookmark_1
AND id > :bookmark_2
...
ORDER BY id
As the number of bookmarks grows, the odds rapidly diminish that you are (a) starting at some point past all n bookmarks but (b) seeing duplicate content anyway because they were all reranked.
If there are holes, or better answers in the future, I'll happily unaccept this answer.


