1

I have a large XML document (~10 mb) that contains data. I need to parse this XML and go through my database to update it. There are 3 cases:

  • New data is present in the XML since last time - insert it into the DB
  • An item has changed - update it in the DB
  • An item has been removed from the XML since last time - mark it as disabled in the DB

If data for an item hasn't changed since the last time I checked the XML, take no action. Each item in the XML has its own ID so I can retrieve it from the database easily.

It's really slow to iterate through the items like this though, but I need the relational aspects of a database (the XML represent items that users can buy, so I can't generate new IDs for the items and just bulk load them).

What would be the best way of implementing this since looping through the XML and looking up each item individually is really slow.

This is done using PHP into a MySQL database.

== edit ==

The basic loop of this process is as follows:

while ($data = get_next_bit_of_data_from_xml())
{
     // this next line is the slow part
     if ($object = get_object_from_database($data['id']))
     {
           // update $object
           // ...

           // persist 
           $object->save();       
     }

     // build new object and persist
     // ...
}

The reason the script is so slow is because I can't use any kind of bulk operations like LOAD DATA INFILE because for each entry in the XML I need to check whether the object already exists.

So I'm really wondering if anyone can suggest another approach that will allow me to bulk load data while maintaining referential integrity with existing data.

Paul J
  • 777
  • 2
  • 8
  • 18

2 Answers2

2

You can use INSERT ... ON DUPLICATE KEY UPDATE see link here. First update all records to be disabled. Then run the above INSERT ... ON DUPLICATE KEY UPDATE query to either insert or update existing rows with new data and also set them to be enabled.

KOGI
  • 3,959
  • 2
  • 24
  • 36
  • I'm dealing with the same issue as you are but with CSV files. I did this: http://stackoverflow.com/questions/6274576/how-do-i-merge-two-tables-in-mysql-and-where-table-1-is-primary for now, BUT it's slow to update all the fields in a table if there are already 200k+ rows in the table, so I'm working making another table hold boolean if it needs to be deleted or not. – Brad Apr 09 '12 at 07:03
1

You could use INSERT DELAYED and UPDATE LOW_PRIORITY to speed the process.

dynamic
  • 46,985
  • 55
  • 154
  • 231
  • `UPDATE LOW_PRIORITY` won't speed things up, but slow them down for the updates: it means that other connection's `SELECT`s have higher precedence. Those other connections, those have the speedup... – Wrikken Jun 08 '11 at 17:25
  • @Wrikken: I believe he selects all the data before. I don't think he makes 1 select for each node – dynamic Jun 08 '11 at 17:49
  • @yes123: you are missing the point: unless you are the _only_ user of a database, **using UPDATE LOW_PRIORITY slows you down**. – Wrikken Jun 08 '11 at 17:50
  • @Wrikken: actually deoens't UPDATE LOW_PRIORITY are meant to speed up the process if you need a lot of updates? – dynamic Jun 08 '11 at 17:51
  • No, that is indeed what `INSERT DELAYED` does, what `UPDATE LOW_PRIORITY` does is saying _'look hey, database, if you have something better to do then my update, do that first, I can wait`_ – Wrikken Jun 08 '11 at 17:52
  • I still believe that can speed up the process I may be wrong tho. Didn't run any test about it – dynamic Jun 08 '11 at 18:05
  • Believe what you want to believe. `LOW_PRIORITY` is to speed up other applications/requests that don't want to wait for your `UPDATE`. I'll end with quoting from the manual: [`With the LOW_PRIORITY keyword, execution of the UPDATE is delayed until no other clients are reading from the table.`](http://dev.mysql.com/doc/refman/5.5/en/update.html), and you can decide whether you don't believe the actual manual or not. – Wrikken Jun 08 '11 at 18:18
  • so what happens on the current script that is doing UPDATE LOW_PRIORITY? Does the query will take forever if other scripts are reading the table? Or the control is instantly returned to script and the query get queued up somewhere? – dynamic Jun 08 '11 at 20:29
  • It's a blocking operation, which means that the query will not be queued, but that your code is hanging until the update has run. It would normally eventually get it's turn, and when there are no other selects at any one point (and we're talking microseconds here) it will go ahead. So: on a database with not much to do, performance loss is negligible. on a very busy queried server with tons of selects, it would take a lot longer then otherwise. `LOW_PRIORITY` is typically something one would do if a change is needed, but it doesn't really matter if other scripts get old data for a few seconds. – Wrikken Jun 08 '11 at 23:39