I have a table in my PostgreSQL 9.0.x DB called cached_projects
which I primarily access via a Rails app, whose schema looks like this:
create_table "cached_projects", :force => true do |t|
t.string "name", :null => false
t.datetime "created_at", :null => false
t.datetime "updated_at", :null => false
t.boolean "localization", :default => false, :null => false
t.integer "base_project_id"
end
The table is populated by a Rake task which runs every N minutes and does the following:
- Queries all project records from a SOAP web service (a "project" in this context is just comprised of a name (string) and a couple boolean and integer fields).
- Syncs the list of projects in the DB with the SOAP results, discarding any records no longer present in the SOAP results, and adding any new records found.
What's the most efficient way to do this periodic sync? It's important that the rows in the DB match exactly the records given in the SOAP results, with no extras.
I can think of two solutions, but unsure which will be the fastest (and performance is a concern since there are tens of thousands of records and I'd like to sync as frequently as possible):
During each sync, DELETE all rows in
cached_projects
and INSERT a fresh row for each project found via the SOAP service.This will accomplish the goal of having exactly the same set of data, but how expensive will it be to delete ~50,000 rows each time in the DB, assuming the vast majority of them will be unchanged? Are there downsides to having this much "churn" in the DB?
During each sync, SELECT all rows in
cached_projects
, storing them in a temporary Hash variable. Loop over the SOAP records and for each one, check whether its already in the DB (using the temporary Hash), keeping a list of IDs of the DB records we find so they can remain in the DB. After comparing everything, INSERT rows for any new records, and DELETE rows for records that should no longer be in the table.This is my current solution, and while it also exactly mirrors the data, comparing all the records within the Rake task is pretty compute-intensive, taking around 3-5 minutes per sync. (Actually, the very first sync, when the DB table is empty, it is faster but on subsequent syncs it must compare against all the rows, which is slower.)
I'm open to other suggestions, or improvements to either of these ideas. So far I've been looking at solutions within my application code (the Ruby-based rake task), but I'm also interested in ideas that push more logic into the DB itself, since I'm a lot less familiar with that side of things.