Fetching large loads of data using paging

Question

Let's for instance say I have a Cloud environment and a Client environment and I want to sync a large amount of data from the cloud to the client. Let's say I have a db table in the cloud named Files and i want the exact identical table to exist in the client environment.

Now let assume a few things:

The files table is very big.
The data of each row in files can be updated at any time and has a last-update column.
I want to fetch the delta's and make sure I am identical in both environments.

My solution:

I make a full sync first, returning all the entries to the client.
I keep the LastSync time in the client environment and keep syncing delta's from the LastSync time.
I do the full sync and the delta syncs using paging: the client will fire a first request for getting the Count of results for the delta and as many other requests needed by the Page Size of each request.

For example, the count:

SELECT COUNT(*) FROM files WHERE last_update > @LastSyncTime

The page fetching:

SELECT col1, col2..
FROM files 
WHERE last_update > @LastSyncTime
ORDER BY files.id
LIMIT @LIMIT 
OFFSET @OFFSET

My problem:

What if for example the first fetch(the Count fetch) will take some time(few minutes for example) and in this time more entries have been updated and added to the last-update fetch.

For example:

The Count fetch gave 100 entries for last-update 1000 seconds.
1 entry updated while fetching the Count.
Now the last-update 1000 seconds will give 101 entries.
The page fetch will only get 100 entries from the 101 with order by id
1 entry is missed and not synced to the client

I have tried 2 other options:

Syncing with from-to date limit for last-update.
Ordering by last-update instead of the id column.

I see issues in both options.

Rick James · Accepted Answer · 2018-06-21T05:30:24.497

Do not use OFFSET and LIMIT; it goes from OK to slow to slower. Instead, keep track of "where you left off" with last_update so that it can be more efficient. More Discussion
Since there can be dups of the datetime, be flexible about how many rows to do at once.
Run this continually. Don't use cron except as a 'keep-alive'.
There is no need for the initial copy; this code does it for you.
It is vital to have INDEX(last_update)

Here is the code:

-- Initialize.  Note: This subtract is consistent with the later compare. 
SELECT @left_off := MIN(last_update) - INTERVAL 1 DAY
    FROM tbl;

Loop:

    -- Get the ending timestamp:
    SELECT @cutoff := last_update FROM tbl
         WHERE last_update > @left_off
         ORDER BY last_update
         LIMIT 1  OFFSET 100;   -- assuming you decide to do 100 at a time
    -- if no result, sleep for a while, then restart

    -- Get all the rows through that timestamp
    -- This might be more than 100 rows
    SELECT * FROM tbl
        WHERE last_update > @left_off
          AND last_update <= @cutoff
        ORDER BY last_update
    -- and transfer them

    -- prep for next iteration
    SET @left_off := @cutoff;

Goto Loop

SELECT @cutoff will be fast -- it is a brief scan of 100 consecutive rows in the index.

SELECT * does the heavy lifting, and takes time proportional to the number of rows -- no extra overhead for OFFSET. 100 rows should take about 1 second for the read (assuming spinning disk, non-cached data).

Instead of initially getting COUNT(*), I would start by getting MAX(last_update) since the rest of the code is basing on last_update. This query is "instantaneous" since it only has to probe the end of the index. But I claim you don't even need that!

A possible bug: If rows in the 'source' can be deleted, how to you recognize that?

I do all the queries by `last_update` as i wrote(indexed of course), but think of the first request, in my case there can be millions of rows in the first request..(as the client if fetching data by a min_date(last_update)), so he needs to sync the entire table, and that's why I must use paging.. — omriman12, Jun 21 '18 at 07:51
So the question is about the first sync(or heavy delta syncs), you must protect it with a paging, otherwise i will fall on timeouts — omriman12, Jun 21 '18 at 07:52
@omriman12 - You have millions of rows with _exactly the same_ `last_update`? If so, we need to consider switching to `id`. If not, my "page" is limited to 100 + (max number of identical `last_update` values); starting with the _first_ page. — Rick James, Jun 21 '18 at 17:30
I see what you mean, so if I will bring it back to my client-server example, so my client will keep the `last_update`(which is the `left_off`) as I did, and simply fetch 1 batch from the server(100 rows for example) and the server will return to the cleint the new `left_off` date, and so on.. yes, this approach can fix my problem — omriman12, Jun 22 '18 at 07:34

score 0 · Answer 2 · answered Jun 18 '18 at 07:36

Depending on the size of the data and if it is used 'public' or can be used among multiple clients it might be helpful to split things up. For example, create daily 'delta's' complete Datasets and cache them. That way the database does not need to be queried for data every client needs over and over again on the first load.

Try to reduce the access to the Big data table as much as possible (cache offsite if data does not change at all).
Offload and cache frequently used data which is used queried a lot. so you reduce the amounts of SQL queries.
Create an index on last_update and id should help speed things up a bit to fetch the delta lines live from the database.

Possible solution:

Database created hourly /complete sets every x times whenever some new items are there.
The client gets "Daily Delta/Hourly delta" on the first fetch from the cache.
Client fetches All items since last delta "latest items" direct from the database.

Might be helpful:

Use query notification to update the cache:MSSQL Query Notifications in SQL Server
When using MySQL: Notify Event Listener in MySQL

But i never talked about performance, I am talking about a logical issue about paging — omriman12, Jun 18 '18 at 08:32

score 0 · Answer 3 · answered Jun 21 '18 at 19:38

Your approach is covering lots of workarounds, you are following into the wrong way.

Start thinking into database replication, it will abstract all those workarounds and will provide you tools to solve their kind of problems.

An excellent article about the recently group replication of MySQL server: https://www.digitalocean.com/community/tutorials/how-to-configure-mysql-group-replication-on-ubuntu-16-04

Fetching large loads of data using paging

3 Answers3