1

I have something like this:

# pseudo code
while True:
   result = make_request_for_data(my_http_request)
   query = "INSERT INTO my_table (col1, col2) VALUES (%s, %s);"

   for data in result:
       cursor.execute(query, data)

   connection.commit() # should this be inside while loop or outside?

   if result is None:  # some breaking mechanism
       break

cursor.close()
connection.close()

Are there performance benefits and why? I will do some timings but wanted to know why one is better than the other, if that is the case.

turnip
  • 2,246
  • 5
  • 30
  • 58
  • It'll be inefficient this way. A better approach would be to break your data set into a batch size where you INSERT N records and then commit that. You need to understand what the database is doing for you. It creates a rollback segment that you commit each time. It shouldn't be one record; it shouldn't be N if the number of records is large. – duffymo Oct 05 '17 at 15:37
  • I suggest, in while loop concat all insert commands together and execute it on database once, on the database operation, the number on times you go to database will be the most expensive – Serjik Oct 05 '17 at 15:38
  • @Serjik to clarify, are you suggesting I concat my queries, such that the query string is multiple insert statements and then run `cursor.execute` once? At each iteration of the loop `result` holds a maximum of 1000 records - is having 1000 INSERT statements a concern? – turnip Oct 05 '17 at 15:57
  • @Petar Then with or without transaction, you should keep a counter and when it comes to a certain number (consider 100*N) then commit to database. Then you will did a DB action 100 times less. – Serjik Oct 05 '17 at 16:11
  • @Petar also this article should be helpful to you https://stackoverflow.com/questions/2741919/can-i-ask-postgresql-to-ignore-errors-within-a-transaction – Serjik Oct 05 '17 at 16:16

1 Answers1

3

That depends, and what it depends upon is considerably more important than performance. Ask yourself:

Is the entire loop a single atomic business operation, or is each iteration of the loop a single atomic business operation?

That is, let's say you're looping over 10 records, and record #5 fails in some way. Should 1-4 still be committed? If so, commit inside the loop. If not, commit outside the loop.

Changing where you commit the data does affect performance, but more significantly it affects the logic of the system being implemented.

David
  • 208,112
  • 36
  • 198
  • 279
  • `result` will hold a maximum of 1000 records on each iteration. If one fails, the others are still valid. So I guess each iteration is atomic. – turnip Oct 05 '17 at 15:41
  • @Petar: If each iteration is atomic then it sounds like there's little alternative than to commit inside the loop. Performance may suffer, but for the purpose of implementing the intended logic. Increasing performance at that point would involve investigating some faster commit mechanism, such as committing to a message queue instead of the database and having a separate process read the queue and commit to the database. But that of course depends on what the end result of this process needs to be and why performance is a concern. If this needs to return to a UI quickly, for example. – David Oct 05 '17 at 15:43
  • This isn't tied to a UI, so it does not have to be blazing fast. I am currently using it to just push some data into a test database. I suppose that if I want to improve performance from this point on, I should look at a bulk insert (?) – turnip Oct 05 '17 at 15:48
  • @Petar: A bulk insert would likely be faster, yes. And it would turn the entire operation into a single atomic operation. Which, from that description, sounds like it would be an acceptable change? If so then that might be worth a look if performance is indeed a pain point. – David Oct 05 '17 at 15:50