0

I have a question related to opening a transaction at partition level. If I use jdbc connector to write to database (postgess), will partition specific writes at worker node be transactionally safe i.e.

If a worker node goes down while writing the data, will the rows related to this partition/ worker node be rolled back?

  • Application level -> no, node (worker / executor) level -> no, executor thread level -> yes. – 10465355 Feb 13 '20 at 13:27
  • @10465355saysReinstateMonica I cannot find the commit aspect on the JDBS side in the docs. I have assumed that there is always one commit per foreach... or partition? Can you point out where I can find that pls? – thebluephantom Feb 13 '20 at 13:39
  • [doc](https://github.com/apache/spark/blob/a834dba120e3569e44c5e4b9f8db9c6eef58161b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L596-L598) and [source](https://github.com/apache/spark/blob/a834dba120e3569e44c5e4b9f8db9c6eef58161b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L616-L742) though as you'll see a lot of depends on nitty-gritty details of configuration an what is supported for a particular provider. – 10465355 Feb 13 '20 at 14:00
  • I wouldn't depend on the transaction. May be for SaveMode.Overwrite – Salim Feb 13 '20 at 19:51

1 Answers1

0

There is a transaction boundary on the partition (see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L588)

But if there are failures afterwards before the task is marked as SUCCESS, for example with a network issue or timeout, then you might still get multiple writes

Sebastian Piu
  • 7,838
  • 1
  • 32
  • 50