What's the most efficient way to periodically sync data in a PostgreSQL DB?

Question

I have a table in my PostgreSQL 9.0.x DB called cached_projects which I primarily access via a Rails app, whose schema looks like this:

create_table "cached_projects", :force => true do |t|
  t.string   "name",                               :null => false
  t.datetime "created_at",                         :null => false
  t.datetime "updated_at",                         :null => false
  t.boolean  "localization",    :default => false, :null => false
  t.integer  "base_project_id"
end

The table is populated by a Rake task which runs every N minutes and does the following:

Queries all project records from a SOAP web service (a "project" in this context is just comprised of a name (string) and a couple boolean and integer fields).
Syncs the list of projects in the DB with the SOAP results, discarding any records no longer present in the SOAP results, and adding any new records found.

What's the most efficient way to do this periodic sync? It's important that the rows in the DB match exactly the records given in the SOAP results, with no extras.

I can think of two solutions, but unsure which will be the fastest (and performance is a concern since there are tens of thousands of records and I'd like to sync as frequently as possible):

During each sync, DELETE all rows in cached_projects and INSERT a fresh row for each project found via the SOAP service.

This will accomplish the goal of having exactly the same set of data, but how expensive will it be to delete ~50,000 rows each time in the DB, assuming the vast majority of them will be unchanged? Are there downsides to having this much "churn" in the DB?
During each sync, SELECT all rows in cached_projects, storing them in a temporary Hash variable. Loop over the SOAP records and for each one, check whether its already in the DB (using the temporary Hash), keeping a list of IDs of the DB records we find so they can remain in the DB. After comparing everything, INSERT rows for any new records, and DELETE rows for records that should no longer be in the table.

This is my current solution, and while it also exactly mirrors the data, comparing all the records within the Rake task is pretty compute-intensive, taking around 3-5 minutes per sync. (Actually, the very first sync, when the DB table is empty, it is faster but on subsequent syncs it must compare against all the rows, which is slower.)

I'm open to other suggestions, or improvements to either of these ideas. So far I've been looking at solutions within my application code (the Ruby-based rake task), but I'm also interested in ideas that push more logic into the DB itself, since I'm a lot less familiar with that side of things.

I think the basic question would be better asked on http://dba.stackexchange.com. — the Tin Man, Mar 18 '13 at 04:40
I suspect the solution will be implemented within my application code (in Ruby) and not within the DB or in SQL, so I posted it here but can look around on DBA too, thanks. — Stuart M, Mar 18 '13 at 04:46
Possibly, but I would phrase a question based on how to update using the base DBM queries. That will be the fastest I/O and data transfer. — the Tin Man, Mar 18 '13 at 04:50
So you're going to pull down 50k rows through SOAP no matter what, right? — mu is too short, Mar 18 '13 at 05:02
This appears to be a pure CRUD problem - exactly what RDMS are designed for. Surely you just need to build 3 SQL queries - `DELETE` where not in the SOAP, `UPDATE` where in the SOAP, `INSERT` on the same criteria as the `UPDATE` — Dale M, Mar 18 '13 at 05:04
But how can that be structured as 3 queries? The data set I get from the SOAP results is in XML form, which I parse and build in-memory representations of all 50K records, and then I have to find the common records between those and the rows currently in the DB. — Stuart M, Mar 18 '13 at 05:09
You could translate the XML to a temporary table in the database and then do the sync with a bit of SQL. Or maybe it would be easier/faster to replace the table wholesale, maybe even dump the XML into new table and do a drop/rename to replace the old one. — mu is too short, Mar 18 '13 at 05:18
Ideally, the SOAP service would have push notifications that you could hook into. That would be better for everyone. Far less load on the service. I'm just sayin`. :) — Jared Beck, Mar 18 '13 at 05:50
@JaredBeck Agreed. Eventually we may be able to implement that (hopefully even revise it to not use SOAP either) but this is what I have to work with today — Stuart M, Mar 18 '13 at 05:59
What has to happen if project gets deleted? Are there side effects in the rest of the database? This is Rails so there probably won't be real foreign keys so you'll have to take care of propagating changes by hand. — mu is too short, Mar 18 '13 at 06:36

score 1 · Accepted Answer · edited Mar 18 '13 at 05:58

1

If your first sync is faster than subsequent syncs, this means inserting all the data into the database is faster than comparing each of them and modify the existing table.

I would suggest you to use TRUNCATE to drop all records in the table instead of delete, then insert all records into the table.

http://en.wikipedia.org/wiki/Truncate_(SQL)

edited Mar 18 '13 at 05:58

Stuart M

11,458
6
45
59

answered Mar 18 '13 at 05:32

user2172816

1,200
8
11

Thanks, this does perform a lot better than `DELETE`ing the whole table. I believe I'll end up just truncating the whole table and re-inserting the rows each time, since that goes pretty quickly. – Stuart M Mar 19 '13 at 18:07

What's the most efficient way to periodically sync data in a PostgreSQL DB?

1 Answers1