I have written a feed aggregator before but am trying to optimize it a bit. In the past, using simplepie (php class) to parse the feeds, I have used the get_id() function for each feed item to return a hash (an md5 mix of link + title). I store this "id" as the "remote_id" in MySQL. However to ensure that I have no duplicates I've been doing a SELECT query for each feed item to ensure that the "remote_id" does not exist. This seems inefficient considering I am looking at 1000's of feeds.
Is it most efficient to just turn remote_id into a unique key and then let the database fail to write the new record on each pass? Any other way to engineer this that is better?