ESENT internals: expected behavior of JetPrereadKeys()

Question

I have an application working with a significant amount of data (100 GB+) stored in ESENT. The schema of the table is: 12-byte JET_bitColumnFixed keys and JET_coltypLongBinary values with a typical size of around 2 KiB. The page size is set to 32 KiB. I don't alter the default 1024 byte size threshold for external long values, so I think that these values are mostly being stored externally.

I am interested in improving the cold cache seek and retrieve performance, because the operations happen in batches and the keys are known in advance. As far as I understand, the JetPrereadKeys() API is designed to improve performance in such cases, but as it turns out, I don't see any changes in the actual behavior with or without this call.

More details follow:

In my case JetPrereadKeys() always reports an adequate number of pre-read keys, equal to the number of keys I have submitted when calling the API. The submitted keys are appropriately sorted, as stated in the documentation.
I tried both synchronous and asynchronous approaches, where the asynchronous approach is: send the pre-read call to a thread pool, while continuing to seek and retrieve data on the current thread.
I tried both available caching modes of ESENT, either where it uses MMAP or a dedicated page cache, by trying all available combinations of the JET_paramEnableViewCache and JET_paramEnableFileCache parameters.
I cannot, with a small exception, see any difference in the logged I/O operations with and without the pre-read. That is, I would expect this operation to result in a (preferably, asynchronous) fetch of the necessary internal nodes of a B-Tree. But the only thing I see is an occasional synchronous small read coming up from the stack of the JetPrereadKeys() itself. The size of the read is small, in the sense that I don't think that it could possibly prefetch all the required information.
If I debug the Windows Search service, I can break on various calls to JetPrereadKeys(). So there is at least one real-world example where this API is being called, presumably for a reason.
All my experiments were performed after a machine restart, to ensure that the database page cache is empty.

Questions:

What is the expected behavior of the JetPrereadKeys() in the described case?
Should I expect to see a different I/O pattern and better performance if I use this API? Should I expect to see a synchronous or an asynchronous pre-read of the data?
Is there an another approach that I could try to improve the I/O performance by somehow hinting ESENT about an upcoming batch?

score 2 · Answer 1 · answered Jan 15 '20 at 22:50

The JetPrereadKeys() API does sync reads to the parent of leaf level, and then enqueues async IOs for all the leaf pages needed for the desired keys / records ...I think that answers #2. If your main table records (note the burst Long Values / LVs are stored in a separate tree) is shallow or entirely cached, this JetPrereadKeys() may not help. However if your primary tree on the table is large and deep, then this API can help dramatically... it just depends upon the shape and spread of your data you are retrieving. You can tell some basics about your table by dumping space and looking at the depth of the trees and getting sense of the "Data" pages, might I suggest:

esentutl /ms Your.Db /v /fName,Depth,Internal,Data

Lists name of table, the depth, how many internal pages and how many leaf level data pages. Separate lines will be listed for the main record tree by the tablename, and then the LVs / as "[Long Values]" below it.

Also note this preread keys does not extend to the burst LVs as well ... so there again, if you immediately read a burst LV column - you'll pin behind IO, unfortunately.

The default mode is for ESE to allocate and control it's own database buffer / page cache exclusively. The JET_paramEnableFileCache is primarily meant for (usually smaller) client processes that quit (or at least JetTerm/JetDetach their DB) and restart a lot ... so where ESE's private buffer cache will be lost on every quit ... but the JET_paramEnableFileCache is a param so the data may still be in the file cache if they quit recently. It is not recommended for large DBs though, because this causes data to be double cached in the ESE buffer cache and in the NTFS / ReFS file cache. The JET_paramEnableViewCache enhances the previous param, and ameliorates this double caching somewhat ... but it can only save memory / not double buffer on clean / un-modified page buffers. For big DBs, leave both these params off / false. Also if you do not use these params, then it is easier to test cold perf ... just copy a big (100 MB, maybe 1 or 2 GB) file around a couple times on your HD after your app quits (to clear HD cache), and your data will be cold. ;-)

So now that we have mentioned caching ... one final thing - that I think is probably your actual problem (if it isn't the "shape of your data" I mention above) ... open perfmon and find the "Database" and/or "Database ==> Instances" perf objects (these are for ESENT) and see how big your cache size is [either "Database Cache Size" or "Database Cache Size (MB)"] and see how big your available pool is / ["Database Cache % Available"]... you'll of course have to take this % and do math against your database cache size to get an idea ... BUT if this is low, this could be your problem ... this is because JetPrereadKeys will only use already available buffers, so you have to have a healthy / big enough available pool. Either increase JET_paramCacheSizeMin to be larger, or set JET_paramStartFlushThreshold / JET_paramStopFlushThreshold to keep your available cache to be a larger % of the total cache size... note they are set as proportion to JET_paramCacheSizeMax, so like setting:

paramCacheSizeMin = 500
paramCacheSizeMax = 100000
paramStartFlush.. =   1000
paramStopFlushT.. =   2000

would mean your start and stop thresholds are 1% and 2% respectively of your current cache size whatever it happens to be. So if cache is at 500 buffers (min), 5 and 10 would be your start/stop thresholds - i.e. the range your available pool would be in, if later it grew to 10000 buffers, then your available pool would range between 100 and 200 buffers. Anyways, you want these numbers to be a good enough range that you have plenty of buffers for all the leaf pages JetPrereadKeys may want.

I didn't explain every term in this email, because you looked pretty advanced above - talking B-tree internal nodes and such ... but if something isn't clear, just ask and I'll clear it up.

Thanks,

Brett Shirley [MSFT]

Extensible Storage Engine Developer

This posting is provided "AS IS" with no warranties, and confers no rights.

P.S. - One last thing you may enjoy playing around with: JetGetThreadStats / JET_THREADSTATS, it tells you some of our internal operations that we do under the API. You basically read the values before and after and JET API, and subtract them to get the # of operations for that JET API. So you will see cPagePreread in there ... this will be a good way to see if JetPrereadKeys is dispatching off the Async IOs that should help perf. Note that particular counter was unfortunately broken in an older OS, but I don't remember when it was broken and fixed ... win7 to win8, win8 to win8.1. If you are on Win10, then no problem it was definitely fixed by then. ;-) And also cPageRead is sync read pages (which may go up for the internal nodes)... I think you'll find these very instructive for the costs of various JET APIs.

ESENT internals: expected behavior of JetPrereadKeys()

1 Answers1