13

I have sample project using spring-boot with spring-data-jpa and postgres db with one table.

I'm trying to INSERT 10 000 records in the loop into the table and measure execution time - enabling or disabling flush() method from EntityManager class for each 100 records.

Expected result is that execution time with enabled flush() method is much less then with disabled one, but actually I have the opposite result.

UserService.java

package sample.data;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

@Service
public class UserService {
    @Autowired
    UserRepository userRepository;

    public User save(User user) {
        return userRepository.save(user);
    }
}

UserRepository.java

package sample.data;

import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.stereotype.Repository;

@Repository
public interface UserRepository extends JpaRepository<User, Long> { }

Application.java

package sample;

import org.springframework.data.jpa.repository.config.EnableJpaRepositories;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.Bean;
import org.springframework.transaction.annotation.Transactional;

import sample.data.User;
import sample.data.UserService;

import javax.persistence.EntityManager;
import javax.persistence.PersistenceContext;

@SpringBootApplication
@EnableJpaRepositories(considerNestedRepositories = true)
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }

    @Autowired
    private UserService userService;

    @PersistenceContext
    EntityManager entityManager;

    @Bean
    public CommandLineRunner addUsers() {
        return new CommandLineRunner() {
            @Transactional
            public void run(String... args) throws Exception {
                long incoming = System.currentTimeMillis();
                for (int i = 1; i <= 10000; i++) {
                    userService.save(new User("name_" + i));

                    if (i % 100 == 0) {
                        entityManager.flush();
                        entityManager.clear();
                    }
                }
                entityManager.close();
                System.out.println("Time: " + (System.currentTimeMillis() - incoming));
            }
        };
    }
}

5 Answers5

9

Make sure you enable JDBC batching in your persistence provider configuration. If you're using Hibernate, add this to your Spring properties:

spring.jpa.properties.hibernate.jdbc.batch_size=20   // or some other reasonable value

Without enabling batching, I guess the performance regression is due to the overhead of clearing the persistence context every 100 entities, but I'm not sure about that (you'd have to measure).

UPDATE:

Actually, enabling JDBC batching or disabling it will not affect the fact that with flush() done every once in a while will not be faster than without it. What you're controlling with the manual flush() is not how the flushing is done (via batched statements or unitary inserts), but instead you're controlling when the flushing to the database will be done.

So what you're comparing is the following:

  1. With flush() every 100 objects: you insert 100 instances into the database upon the flush, and you do this 10000 / 100 = 100 times.
  2. Without flush(): you just collect all 10000 objects in the context in memory and do 10000 inserts upon committing the transaction.

JDBC batching on the other affects how the flushing occurs, but it's still the same number of statements issued with flush() vs without flush().

The benefit of flushing and clearing every once in a while in a loop is to avoid a possible OutOfMemoryError due to the cache holding too many objects.

M A
  • 71,713
  • 13
  • 134
  • 174
3

Writing a micro benchmark is hard, which is greatly illustrated by Aleksey Shipilev in his "JMH vs Caliper: reference thread" post. Your case is not exactly a micro benchmark but:

  1. Below 10,000 repetitions won't let the JVM to warm up and JIT the code on the default settings. Before measuring code performance warm up the JVM.

  2. System.nanoTime() not System.currentTimeMillis() for measuring elapsed time. If you are measuring in ms your results will get skewed by clock drift in System.currentTimeMillis().

  3. You most likely want to measure this on the database end to pinpoint the bottleneck. Without bottleneck it's hard to understand what is the root cause e.g. your database might be on the other side of the Atlantic Ocean and the network connection cost will overshadow INSERT statement cost.

  4. Is your benchmark sufficiently isolated? If the database is shared by multiple users and connections, other than your benchmark it's performance will vary.

Find the bottleneck in the current setup, make an assumption on how to verify it, change the benchmark to match the assumption and then measure again to confirm. That's the only way to figure it out.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
3

Can you please explain why you believe:

Expected result is that execution time with enabled flush() method is much less then with disabled one

It seems to me that this is a fundamentally faulty assumption. There is no strong reason to believe that performing this trivial operation 10k times will be FASTER with a flush than without.

As long as all of the records fits into memory, I would expect that the non-intermediate-flush version to be faster. What indicates that performing network IO to access the database 100 times should be faster than doing it 1 time at the end?

Ben M
  • 1,833
  • 1
  • 15
  • 24
1

The most expensive part of persisting an entity is writing to the database. The time spent persisting the entity in JPA is trivial in comparison, since it is a pure in-memory operation. It's IO compared to memory.

Writing to the database might also have a quite significant static overhead, which means that the number of times you write to the database might affect execution time. When you invoke EntityManager#flush, you instruct Hibernate to write all pending changes to the database.

So what you are doing is comparing an execution with 100 database writes, to one with one database write. Due to the overhead of IO, the former will be significantly slower.

Tobb
  • 11,850
  • 6
  • 52
  • 77
0

Two aspects that are not mentioned by the other answers. Besides flushing you need to clear the Hibernate session. Without clearing it it will grow and will impact your memory consumption which may lead to performance penalty.

One more thing when persisting entities make sure your ID generator uses a hilosequence. If your IDs are 1,2,3,4,5..... each insert will have extra roundtrip in order to increment the ID.

Alexander Petrov
  • 9,204
  • 31
  • 70