23

I'm trying to write a method that will return a Hibernate object based on a unique but non-primary key. If the entity already exists in the database I want to return it, but if it doesn't I want to create a new instance and save it before returning.

UPDATE: Let me clarify that the application I'm writing this for is basically a batch processor of input files. The system needs to read a file line by line and insert records into the db. The file format is basically a denormalized view of several tables in our schema so what I have to do is parse out the parent record either insert it into the db so I can get a new synthetic key, or if it already exists select it. Then I can add additional associated records in other tables that have foreign keys back to that record.

The reason this gets tricky is that each file needs to be either totally imported or not imported at all, i.e. all inserts and updates done for a given file should be a part of one transaction. This is easy enough if there's only one process that's doing all the imports, but I'd like to break this up across multiple servers if possible. Because of these constraints I need to be able to stay inside one transaction, but handle the exceptions where a record already exists.

The mapped class for the parent records looks like this:

@Entity
public class Foo {
    @Id
    @GeneratedValue(strategy = IDENTITY)
    private int id;
    @Column(unique = true)
    private String name;
    ...
}

My initial attempt at writting this method is as follows:

public Foo findOrCreate(String name) {
    Foo foo = new Foo();
    foo.setName(name);
    try {
        session.save(foo)
    } catch(ConstraintViolationException e) {
        foo = session.createCriteria(Foo.class).add(eq("name", name)).uniqueResult();
    }
    return foo;
}

The problem is when the name I'm looking for exists, an org.hibernate.AssertionFailure exception is thrown by the call to uniqueResult(). The full stack trace is below:

org.hibernate.AssertionFailure: null id in com.searchdex.linktracer.domain.LinkingPage entry (don't flush the Session after an exception occurs)
    at org.hibernate.event.def.DefaultFlushEntityEventListener.checkId(DefaultFlushEntityEventListener.java:82) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.event.def.DefaultFlushEntityEventListener.getValues(DefaultFlushEntityEventListener.java:190) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.event.def.DefaultFlushEntityEventListener.onFlushEntity(DefaultFlushEntityEventListener.java:147) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.event.def.AbstractFlushingEventListener.flushEntities(AbstractFlushingEventListener.java:219) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.event.def.AbstractFlushingEventListener.flushEverythingToExecutions(AbstractFlushingEventListener.java:99) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.event.def.DefaultAutoFlushEventListener.onAutoFlush(DefaultAutoFlushEventListener.java:58) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.impl.SessionImpl.autoFlushIfRequired(SessionImpl.java:1185) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1709) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.impl.CriteriaImpl.list(CriteriaImpl.java:347) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]
    at org.hibernate.impl.CriteriaImpl.uniqueResult(CriteriaImpl.java:369) [hibernate-core-3.6.0.Final.jar:3.6.0.Final]

Does anyone know what is causing this exception to be thrown? Does hibernate support a better way of accomplishing this?

Let me also preemptively explain why I'm inserting first and then selecting if and when that fails. This needs to work in a distributed environment so I can't synchronize across the check to see if the record already exists and the insert. The easiest way to do this is to let the database handle this synchronization by checking for the constraint violation on every insert.

Vlad Mihalcea
  • 142,745
  • 71
  • 566
  • 911
Mike Deck
  • 18,045
  • 16
  • 68
  • 92
  • What do you mean by "distributed environment"? Do you use a grid-based RDBMS? – vbence Apr 26 '11 at 22:22
  • 1
    @vbence by distributed I mean I have multiple clients on different machines running this code against a single centralized database. Two separate clients could attempt to insert the same record at the same time, in that case, the "first" one should win and be persisted and the other client should just return what has already been persisted. – Mike Deck Apr 27 '11 at 14:45
  • are you not really seeing the consequences of not breaking up the data correctly? – ThomasRS May 02 '11 at 00:38
  • @Thomas, I'm not sure what you mean by "breaking up the data correctly." – Mike Deck May 02 '11 at 14:52
  • Did you consider pre-processing your batch data so that there are no conflicts? Or simply process the conflicting parts first, in a single thread? – ThomasRS May 02 '11 at 19:21

9 Answers9

13

I had a similar batch processing requirement, with processes running on multiple JVMs. The approach I took for this was as follows. It is very much like jtahlborn's suggestion. However, as vbence pointed out, if you use a NESTED transaction, when you get the constraint violation exception, your session is invalidated. Instead, I use REQUIRES_NEW, which suspends the current transaction and creates a new, independent transaction. If the new transaction rolls back it will not affect the original transaction.

I am using Spring's TransactionTemplate but I'm sure you could easily translate it if you do not want a dependency on Spring.

public T findOrCreate(final T t) throws InvalidRecordException {
   // 1) look for the record
   T found = findUnique(t);
   if (found != null)
     return found;
   // 2) if not found, start a new, independent transaction
   TransactionTemplate tt = new TransactionTemplate((PlatformTransactionManager)
                                            transactionManager);
   tt.setPropagationBehavior(TransactionDefinition.PROPAGATION_REQUIRES_NEW);
   try {
     found = (T)tt.execute(new TransactionCallback<T>() {
        try {
            // 3) store the record in this new transaction
            return store(t);
        } catch (ConstraintViolationException e) {
            // another thread or process created this already, possibly
            // between 1) and 2)
            status.setRollbackOnly();
            return null;
        }
     });
     // 4) if we failed to create the record in the second transaction, found will
     // still be null; however, this would happy only if another process
     // created the record. let's see what they made for us!
     if (found == null)
        found = findUnique(t);
   } catch (...) {
     // handle exceptions
   }
   return found;
}
Lawrence McAlpin
  • 2,745
  • 20
  • 24
  • I'm surprised this works. I was under the impression that hibernate does not support nested transactions. – Mike Deck Apr 28 '11 at 20:25
  • Hibernate delegates to the underlying transaction manager. Spring provides the support for nested and suspended transactions. An EJB server would also support that via JTA. There are also some standalone JTA providers like Atomikos that are an option if you want to avoid both an EJB app server as well as Spring. – Lawrence McAlpin Apr 28 '11 at 21:34
  • 1
    @LawrenceMcAlpin Nice job! In my case though, in step 4 when `found` is `null`, `findUnique()` is returning `null` even if the record actually exists in database (since created by another thread). I had to implement another `TransactionTemplate` to read the record. Do you understand why? – sp00m Oct 09 '14 at 19:30
  • Mike Deck: I really wonder how this could work for you. As observed by @sp00m a record created in a different transaction can not be visible in another concurrent transaction (as soon as their point-in-time snapshots are created). See the discussion below Vlad Mihalceas answer. – bgraves Nov 06 '17 at 20:39
11

You need to use UPSERT or MERGE to achieve this goal.

However, Hibernate does not offer support for this construct, so you need to use jOOQ instead.

private PostDetailsRecord upsertPostDetails(
        DSLContext sql, Long id, String owner, Timestamp timestamp) {
    sql
    .insertInto(POST_DETAILS)
    .columns(POST_DETAILS.ID, POST_DETAILS.CREATED_BY, POST_DETAILS.CREATED_ON)
    .values(id, owner, timestamp)
    .onDuplicateKeyIgnore()
    .execute();

    return sql.selectFrom(POST_DETAILS)
    .where(field(POST_DETAILS.ID).eq(id))
    .fetchOne();
}

Calling this method on PostgreSQL:

PostDetailsRecord postDetailsRecord = upsertPostDetails(
    sql, 
    1L, 
    "Alice",
    Timestamp.from(LocalDateTime.now().toInstant(ZoneOffset.UTC))
);

Yields the following SQL statements:

INSERT INTO "post_details" ("id", "created_by", "created_on") 
VALUES (1, 'Alice',  CAST('2016-08-11 12:56:01.831' AS timestamp))
ON CONFLICT  DO NOTHING;
    
SELECT "public"."post_details"."id",
       "public"."post_details"."created_by",
       "public"."post_details"."created_on",
       "public"."post_details"."updated_by",
       "public"."post_details"."updated_on"
FROM "public"."post_details"
WHERE "public"."post_details"."id" = 1

On Oracle and SQL Server, jOOQ will use MERGE while on MySQL it will use ON DUPLICATE KEY.

The concurrency mechanism is ensured by the row-level locking mechanism employed when inserting, updating, or deleting a record, which you can view in the following diagram:

enter image description here

Code avilable on GitHub.

Vlad Mihalcea
  • 142,745
  • 71
  • 566
  • 911
  • 2
    In jOOQ, depending on your database support (e.g. PostgreSQL), you can append `returning()` to the `INSERT` statement, and it will get you that record in a single statement. – Lukas Eder Nov 03 '17 at 13:11
  • @LukasEder Cool! I didn't know about the `RETURNING` clause in PostgreSQL. – Vlad Mihalcea Nov 03 '17 at 13:22
  • What if there are two concurrent transactions which both try to create the same record and afterwards (still inside the transaction) do some stuff with the newly created record? No problem inside the succeeding transaction, but in the transaction which ignores the duplicate key, the record created in the other transaction is not visible/selectable, I think. Am I wrong? How should this be handled? – bgraves Nov 03 '17 at 16:48
  • No conflicts, I agree, but **inside** the second Tx the entry which was created in the first Tx is not visible/selectable and therefore cannot be used **inside** the second Tx. – bgraves Nov 03 '17 at 22:16
  • The insert precedes the select. If the row is locked by Tx1, Tx2 will wait. When Tx1 commits, Tx2 will resume. So, the record becomes visible to Tx2 due to row-level serializability. – Vlad Mihalcea Nov 03 '17 at 22:28
  • Hi @VladMihalcea, I created a small [repo](https://github.com/drahkrub/onDuplicateKeyUpdate). At least when using mysql 5.5.x the record created in Tx1 is not visible in Tx2 **such that it can be selected** (in Tx2). Or am I doing something wrong? – bgraves Nov 05 '17 at 22:07
  • I've done something terribly wrong, preparing to explain it... ;-) – bgraves Mar 15 '19 at 20:49
8

Two solution come to mind:

That's what TABLE LOCKS are for

Hibernate does not support table locks, but this is the situation when they come handy. Fortunately you can use native SQL thru Session.createSQLQuery(). For example (on MySQL):

// no access to the table for any other clients
session.createSQLQuery("LOCK TABLES foo WRITE").executeUpdate();

// safe zone
Foo foo = session.createCriteria(Foo.class).add(eq("name", name)).uniqueResult();
if (foo == null) {
    foo = new Foo();
    foo.setName(name)
    session.save(foo);
}

// releasing locks
session.createSQLQuery("UNLOCK TABLES").executeUpdate();

This way when a session (client connection) gets the lock, all the other connections are blocked until the operation ends and the locks are released. Read operations are also blocked for other connections, so needless to say use this only in case of atomic operations.

What about Hibernate's locks?

Hibernate uses row level locking. We can not use it directly, because we can not lock non-existent rows. But we can create a dummy table with a single record, map it to the ORM, then use SELECT ... FOR UPDATE style locks on that object to synchronize our clients. Basically we only need to be sure that no other clients (running the same software, with the same conventions) will do any conflicting operations while we are working.

// begin transaction
Transaction transaction = session.beginTransaction();

// blocks until any other client holds the lock
session.load("dummy", 1, LockOptions.UPGRADE);

// virtual safe zone
Foo foo = session.createCriteria(Foo.class).add(eq("name", name)).uniqueResult();
if (foo == null) {
    foo = new Foo();
    foo.setName(name)
    session.save(foo);
}

// ends transaction (releasing locks)
transaction.commit();

Your database has to know the SELECT ... FOR UPDATE syntax (Hibernate is goig to use it), and of course this only works if all your clients has the same convention (they need to lock the same dummy entity).

vbence
  • 20,084
  • 9
  • 69
  • 118
  • This is good advice and answers the question the way I originally worded it, but technically still won't solve my problem. I've updated the question to be clear about the root issue I'm facing. Basically I need to be able to import multiple records within a single transaction, so waiting until the whole transaction is committed to release the lock sort of defeats the purpose of using multiple client processes in the first place. I'm starting to think I'll need to use raw JDBC to do this if I really want to stay with this architecture. – Mike Deck Apr 28 '11 at 14:50
  • You can create tamporary tables based on the `CREATE TABLE` of the original. A client can have the ownership and do its job peasefully. An other process or stored procedure can then copy the records to the live DB. Alternatively (if you are working with many tables) temporary databses can be created too. – vbence Apr 28 '11 at 17:50
  • 1
    Instead of using explicit table locks which would depedent on native SQL syntax and features, smilar behavious could be achieved setting the JDBC transaction isolation level to TRANSACTION_SERIALIZABLE. Advantage would be that the tables to lock would then be determined by the DB itself which avoid explicitly specifiyng the tables to lock. To do this use Session.doWork and connection.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE) – Christian K. Dec 05 '14 at 14:00
2

The Hibernate documentation on transactions and exceptions states that all HibernateExceptions are unrecoverable and that the current transaction must be rolled back as soon as one is encountered. This explains why the code above does not work. Ultimately you should never catch a HibernateException without exiting the transaction and closing the session.

The only real way to accomplish this it would seem would be to manage the closing of the old session and reopening of a new one within the method itself. Implementing a findOrCreate method which can participate in an existing transaction and is safe within a distributed environment would seem to be impossible using Hibernate based on what I have found.

Mike Deck
  • 18,045
  • 16
  • 68
  • 92
2

a couple people have mentioned different parts of the overall strategy. assuming that you generally expect to find an existing object more often than you create a new object:

  • search for existing object by name. if found, return
  • start nested (separate) transaction
    • try to insert new object
    • commit nested transaction
  • catch any failure from nested transaction, if anything but constraint violation, re-throw
  • otherwise search for existing object by name and return it

just to clarify, as pointed out in another answer, the "nested" transaction is actually a separate transaction (many databases don't even support true, nested transactions).

jtahlborn
  • 52,909
  • 5
  • 76
  • 118
  • It is still possible that a new record is inserted between the first two steps, so the problem remains basically the same as in the original post. If you get a constant violation exception your session is invalidated. – vbence Apr 28 '11 at 08:30
  • @vbence - yes, i accounted for that. when you get a constraint violation you swallow that and load the new object in your outer session. this is the only way to do this in a distributed environment. I have done this very thing using hibernate, so i know it works. – jtahlborn Apr 28 '11 at 17:48
  • @vbence - you are using a nested transaction/session, which does not affect the outer one. – jtahlborn Apr 28 '11 at 18:21
  • It was not obvious based on your flowchart. – vbence Apr 29 '11 at 10:44
2

The solution is in fact really simple. First perform a select using your name value. If a result is found, return that. If not, create a new one. In case the creation fail (with an exception), this is because another client added this very same value between your select and your insert statement. This is then logical that you have an exception. Catch it, rollback your transaction and run the same code again. Because the row already exist, the select statement will find it and you'll return your object.

You can see here explanation of strategies for optimistic and pessimistic locking with hibernate here : http://docs.jboss.org/hibernate/core/3.3/reference/en/html/transactions.html

Nicolas Bousquet
  • 3,990
  • 1
  • 16
  • 18
1

Well, here's one way to do it - but it's not appropriate for all situations.

  • In Foo, remove the "unique = true" attribute on name. Add a timestamp that gets updated on every insert.
  • In findOrCreate(), don't bother checking if the entity with the given name already exists - just insert a new one every time.
  • When looking up Foo instances by name, there may be 0 or more with a given name, so you just select the newest one.

The nice thing about this method is that it doesn't require any locking, so everything should run pretty fast. The downside is that your database will be littered with obsolete records, so you may have to do something somewhere else to deal with them. Also, if other tables refer to Foo by its id, then this will screw up those relations.

Mike Baranczak
  • 8,291
  • 8
  • 47
  • 71
0

Maybe you should change your strategy: First find the user with the name and only if the user thoes not exist, create it.

Iogui
  • 1,526
  • 1
  • 17
  • 28
  • 1
    I explained why I chose the strategy I did at the end of the question. Doing the check before the insert does not work in a distributed environment where you can't synchronize across the two operations. – Mike Deck Feb 16 '11 at 23:07
  • Maybe you don't have to synchronize. If you make a search first, you will create just if the user does not exist and if, in this time, someone create one with this name, the database will handle this for you and when you save, you will get a exception, so you just do something about it. – Iogui Feb 16 '11 at 23:22
  • @logui, when you say "you will get a exception, so you just do something about it", what do you propose I do? At that point I would then need to run another select to retrieve that entry out of the db, except that after the exception is thrown the session is dead and I'm required to roll the transaction back. In this scenario I'm back to square one solving the exact same problem I asked about in the first place. – Mike Deck Mar 17 '11 at 22:10
0

I would try the following strategy:

A. Start a main transaction (at time 1)
B. Start a sub-transaction (at time 2)

Now, any object created after time 1 will not be visible in the main transaction. So when you do

C. Create new race-condition object, commit sub-transaction
D. Handle conflict by starting a new sub-transaction (at time 3) and getting the object from a query (the sub-transaction from point B is now out-of-scope).

only return the object primary key and then use EntityManager.getReference(..) to obtain the object you will be using in the main transaction. Alternatively, start the main transaction after D; it is not totally clear to me in how many race conditions you will have within your main transaction, but the above should allow for n times B-C-D in a 'large' transaction.

Note that you might want to do multi-threading (one thread per CPU) and then you can probably reduce this issue considerably by using a shared static cache for these kind of conflicts - and point 2 can be kept 'optimistic', i.e. not doing a .find(..) first.

Edit: For a new transaction, you need an EJB interface method call annotated with transaction type REQUIRES_NEW.

Edit: Double check that the getReference(..) works as I think it does.

ThomasRS
  • 8,215
  • 5
  • 33
  • 48