1

I'm trying to insert some data into a MariaDB database. I got two tables and I have to insert the rows (using a batch insert) into the first table and use the IDs of the newly-inserted rows to perform a second batch insert into the second table.

I'm doing so in Scala using Alpakka Slick. For the purpose of this question, let's call tests the main table and dependent the second one.

At the moment, my algorithm is as follows:

  1. Insert the rows into tests
  2. Fetch the ID of the first row in the batch using SELECT LAST_INSERT_ID();
  3. Knowing the ID of the first row and the number of rows in the batch, compute by hand the other IDs and use them for the insertion in the second table

This works pretty well with only one connection at a time. However, I'm trying to simulate a scenario with multiple attempts to write simultaneously. To do that, I'm using Scala parallel collections and Akka Stream Source as follows:

// three sources of 10 random Strings each
val sources = Seq.fill(3)(Source(Seq.fill(10)(Random.alphanumeric.take(3).mkString))).zipWithIndex
val parallelSources: ParSeq[(Source[String, NotUsed], Int)] = sources.par

parallelSources.map { case (source, i) =>
  source
    .grouped(ChunkSize) // performs batch inserts of a given size
    .via(insert(i))
    .zipWithIndex
    .runWith(Sink.foreach { case (_, chunkIndex) => println(s"Chunk $chunkIndex of source $i  done") })
}

I'm adding an index to each Source just to use it a prefix in the data I write in the DB.

Here's the code of the insert Flow I've written so far:

def insert(srcIndex: Int): Flow[Seq[String], Unit, NotUsed] = {
  implicit val insertSession: SlickSession = slickSession
  system.registerOnTermination(() => insertSession.close())

  Flow[Seq[String]]
    .via(Slick.flowWithPassThrough { chunk =>
      (for {
        // insert data into `tests`
        _ <- InsTests ++= chunk.map(v => TestProj(s"source$srcIndex-$v"))
        // fetch last insert ID and connection ID
        queryResult <- sql"SELECT CONNECTION_ID(), LAST_INSERT_ID();".as[(Long, Long)].headOption
        _ <- queryResult match {
          case Some((connId, firstIdInChunk)) =>
            println(s"Source $srcIndex, last insert ID $firstIdInChunk, connection $connId")
            // compute IDs by hand and write to `dependent`
            val depValues = Seq.fill(ChunkSize)(s"source$srcIndex-${Random.alphanumeric.take(6).mkString}")
            val depRows =
              (firstIdInChunk to (firstIdInChunk + ChunkSize))
                .zip(depValues)
                .map { case (index, value) => DependentProj(index, value) }

            InsDependent ++= depRows
          case None => DBIO.failed(new Exception("..."))
        }
      } yield ()).transactionally
  })
}

Where InsTests and InsDependent are Slick's TableQuery objects. slickSession creates a new session for each different insert and is defined as follows:

private def slickSession = {
  val db = Database.forURL(
    url = "jdbc:mariadb://localhost:3306/test",
    user = "root",
    password = "password",
    executor = AsyncExecutor(
      name = "executor",
      minThreads = 20,
      maxThreads = 20,
      queueSize = 1000,
      maxConnections = 20
    )
  )
  val profile = slick.jdbc.MySQLProfile
  SlickSession.forDbAndProfile(db, profile)
}

The problem is that the last insert IDs returned by the second step of the algorithm overlap. Every run of this app would print something like:

Source 2, last insert ID 6, connection 66
Source 1, last insert ID 5, connection 68
Source 0, last insert ID 7, connection 67
Chunk 0 of source 0  done
Chunk 0 of source 2  done
Chunk 0 of source 1  done
Source 2, last insert ID 40, connection 70
Source 0, last insert ID 26, connection 69
Source 1, last insert ID 27, connection 71
Chunk 1 of source 2  done
Chunk 1 of source 1  done
Chunk 1 of source 0  done

Where it looks like the connection is a different one for each Source, but the IDs overlap (Source 0 sees 7, source 1 sees 5, source 2 sees 2). It is correct that IDs start from 5, as I'm adding 4 dummy rows right after creating the tables (not shown in this question's code). Obviously, I see multiple rows in dependent with the same tests.id, which shouldn't happen.

It's my understanding that last insert IDs refer to a single connection. How is it possible that three different connections see overlapping IDs, considering that the entire flow is wrapped in a transaction (via Slick's transactionally)?

This happens with innodb_autoinc_lock_mode=1. As far as I've seen so far, it doesn't with innodb_autoinc_lock_mode=0, which makes sense, since InnoDB would lock tests until the whole batch insert terminates.

UPDATE after Georg's answer: For some other constraints in the project, I'd like the solution to be compatible with MariaDB 10.4, which, as far as I understand, doesn't feature INSERT...RETURNING. Additionally, Slick's ++= operator's support for returning is quite bad, as also reported here. I tested it on both MariaDB 10.4 and 10.5, and, according to the query logs, Slick does execute single INSERT INTO statements instead of a batch one. In my case, this is not quite acceptable, as I'm planning on writing several chunks of rows in a streaming fashion.

While I also understand that making assumptions about the auto-increment value being 1 is not ideal, we do have control over the Production setup and do not have multi-master replication.

madipi
  • 355
  • 2
  • 11

1 Answers1

0

You cannot generate subsequent values based on LAST_INSERT_ID():

  1. There might be a second transaction which was rolled back running at the same time, so there will be a gap in your auto_incremented ID's.

  2. Iterating over the number of rows by incrementing LAST_INSERT_ID value will not work, since it depends of value of session variable @@auto_increment_increment (which is especially in multi master replication not 1).

Instead, you should use RETURNING to get the ID's of inserted rows:

MariaDB [test]> create table t1 (a int not null auto_increment primary key);
Query OK, 0 rows affected (0,022 sec)

MariaDB [test]> insert into t1 (a) values (1),(3),(NULL), (NULL) returning a;
+---+
| a |
+---+
| 1 |
| 3 |
| 4 |
| 5 |
+---+
4 rows in set (0,006 sec)
Georg Richter
  • 5,970
  • 2
  • 9
  • 15
  • Thanks for replying! What you said is very true. I went down that road in my tests, but that turned out to be a dead end. I updated the question with a few other details about this. It's interesting what you said about the transactions though. As far as I understand, `LAST_INSERT_ID()` should be tied to a given connection and it should not suffer from concurrency issues, especially with `innodb_autoinc_lock_mode` set to either `0` or `1`. As a matter of fact, the IDs look correct if I put a `WRITE` lock on `tests`, but, still, I don't understand why it wouldn't work without locks. – madipi Sep 09 '22 at 08:03