Spring Data Repository - Paging large data sets (EclipseLink)

Question

I am using Spring Data with EclipseLink JPA to do server side pagination on a database result set. I have everything working and I get the expected paged results, but I noticed performance suffering on large data sets (several million rows). It is taking about 5 minutes to return a page of 20 results. Perhaps this is to be expected, but what concerned me was the query output.

My log output:

SELECT COUNT(filename) FROM document
SELECT filename, datecaptured, din, docdate, docid, doctype, drawer, foldernumber, format, pagenumber, tempfilename, userid FROM document ORDER BY din ASC

I would understand that in order to page, Spring would need to know the max row count, so the first query makes sense.

The second query is pulling the entire database when I specifically only asked for 20 results with a 0 offset (page).

Does Spring/EclipseLink/JPA in fact grab the entire data set and then only return the subset paged request?

If that is the case, how should I modify my repository class to be more efficient?

My test case:

@Test
public void getPagedDocumentsTest() throws IOException {
    Page<Document> requestedPage = documentRepository.findAll(new PageRequest(0, 20, Sort.Direction.ASC, "din"));

    Assert.assertNotNull("Page is null", requestedPage);
    Assert.assertNotNull("Page is empty", requestedPage.getContent());

    List<Document> documents = requestedPage.getContent();

    LOG.info("{}", documents);
    LOG.info("{}", documents.size());
}

My repository class:

import org.springframework.data.jpa.repository.JpaSpecificationExecutor;
import org.springframework.data.repository.PagingAndSortingRepository;
import org.springframework.stereotype.Repository;

import com.example.data.model.Document;

@Repository
public interface DocumentRepository extends PagingAndSortingRepository<Document, String> {

}

Edit - per @Chris's suggestion

Tried adding the platform to my properties, but it didn't make a difference:

eclipselink.weaving=static
eclipselink.allow-zero-id=true
eclipselink.target-database=SQLServer
eclipselink.logging.level=FINE

Also tried adding it to my configuration (I'm using Java Config):

@Bean
public LocalContainerEntityManagerFactoryBean entityManager() {
    LocalContainerEntityManagerFactoryBean factory = new LocalContainerEntityManagerFactoryBean();
    factory.setPersistenceUnitName("ExampleUnit");
    factory.setPackagesToScan("com.example.data.model");

    EclipseLinkJpaVendorAdapter eclipseLinkVendorAdapter = new EclipseLinkJpaVendorAdapter();
    eclipseLinkVendorAdapter.setDatabase(Database.SQL_SERVER);
    eclipseLinkVendorAdapter.setDatabasePlatform("SQLServer");
    factory.setJpaVendorAdapter(eclipseLinkVendorAdapter);

    factory.setDataSource(dataSource());
    factory.setJpaProperties(jpaProperties());
    factory.setLoadTimeWeaver(new InstrumentationLoadTimeWeaver());

    return factory;
}

Looks like the platform is set correctly.

[EL Config]: connection: 2015-08-06 12:04:05.691--ServerSession(686533955)--Connection(1896042043)--Thread(Thread[main,5,main])--connecting(DatabaseLogin(
    platform=>SQLServerPlatform
    user name=> ""
    connector=>JNDIConnector datasource name=>null
))

But neither helped. The SQL query output remained the same as well.

Edit

Found a related question with a similar answer from @Chris:

EclipseLink generated SQL doesn't include pagination

Try specifying the target database platform EclipseLink is to use in the persistence properties to ensure that if the max and first results are being set, it will use database filtering (limit/rownum built into the SQL) rather than JDBC filtering. — Chris, Aug 06 '15 at 15:33
Thanks for the suggestion @Chris, but adding the platform didn't fix the problem; see my edits. — CoffeePasta, Aug 06 '15 at 17:10
You should create an index on `din`. That way the (sorted) index will be used to find the page corresponding database rows,instead of doing a full table scan followed by a sort. — Ori Dar, Aug 06 '15 at 17:23
I added the index but it didn't resolve the issue; execution time remains around 5 minutes. — CoffeePasta, Aug 06 '15 at 17:48

Apokralipsa · Answer 1 · 2015-08-15T12:40:50.623

One thing you should consider is whether you actually need to know the number of pages / total number of elements. If you are returning a page from a result set that has milions of elements, chances are your users will not be interested in looking through all those pages either way :). Maybe your front end shows the data in a an infinite scroll that just needs to know, if there are any more pages, instead of number of pages.

If any of those cases apply to you, you should consider returning a Slice instead of a Page, as in:

public Slice<MyClass> findByMyField(..);

This way, instead of doing the expensive Count, Spring Data will just ask for one more element than you originally wanted. If that element is present, the Slice will return true from the hasNext method.

Where I work we recently used Slices for several large data sets and with the right indexes (and after clearing the database cache :) we have seen some really significant gains.

Nice suggestion @Apokralipsa. I did notice a sizable increase in performance by using Slice instead; as you mentioned, the `count` was taking a considerable amount of time. I will definitely keep this in mind if I can get away with not having to page my results. — CoffeePasta, Aug 07 '15 at 18:31
You can page your results using Slices. You can declare a `Pageable` parameter just as when using Page. You just don't know how much pages are there ahead of time. — Apokralipsa, Aug 07 '15 at 18:34

score 0 · Accepted Answer · answered Aug 07 '15 at 13:28

EclipseLink 2.5 source that I checked I believe has support for database level filtering built into the following database platform classes:

DB2Platform
DerbyPlatform
FirebirdPlatform
H2Platform
HANAPlatform
HSQLPlatform
MySQLPlatform
OraclePlatform
PostgreSQLPlatform
SymfowarePlatform

Each of these override the printSQLSelectStatement method to take advantage of their respective database features to allow filtering in the SQL itself. Other platforms will need to use JDBC filtering, which depend on the driver to restrict rows - they may be able to optimize queries, but it is driver specific and I believe it is why your query takes longer then you desire.

I don't know SQLServer well enough to say what equivalent functionality it has that can be used within the SQL, but if you find it, you would need to create a SQLServerPlatform subclass, override the printSQLSelectStatement method as is done in the above classes, and then specify that platform class be used instead. Please also file a bug/feature to have it included in EclipseLink.

Other options are described here: http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination

Spring Data Repository - Paging large data sets (EclipseLink)

2 Answers2