What datastore should I use to store temporary data from crawlers?

Question

My crawler is crawling all websites and getting metadata information from them. I then will run a script to sanitize the URLs and store them in Amazon RDS.

My problem is what datastore should I use to store data for sanitization purpose (Delete unwanted URLs). I don't want the crawler to hit the Amazon RDS which would slow it down.

Should I be using Amazon SimpleDB? Then I can read from SimpleDB, sanitize the URL and move it to Amazon RDS.

score 1 · Accepted Answer · answered Jul 12 '11 at 18:25

1

You can always use a db, but the issue is with the disk access. Everytime you would be doing a disk access to read a bunch of URLs sanitize them and again write them to another db which is another disk access. This process is OK if you aren't concerned about performance.

One solution is you can use any data structure as simple as a list, store a bunch or URLs have a thread which wakes up when the list hits a threshold cleans up the URLs and then you can write these URLs to the Amazon RDS.

answered Jul 12 '11 at 18:25

ShadowFax

95
9

The problem is I could store it in lists but the sanitizer script could be a independent worker residing in an other machine. I dont think disk access to simpledb would be of an issue. Major concern is the RDS since that is also frontend facing. I was planning to sanitize the data periodically in batches. Would that sound good? – Sarvesh Jul 12 '11 at 18:35
when the sanitizer script reads URLs from the simple DB how is it going to keep track of how many URLs it has read from the DB and how are you planning on deleting the dirty URLs which you already read? – ShadowFax Jul 12 '11 at 18:52
I would probably use a list to fetch the urls from simpledb, clean and move only the valid ones into RDS. I would be updating a flag in the simpledb for each record whether it is valid or updated into RDS. – Sarvesh Jul 12 '11 at 19:03
this looks OK if you aren't concerned about performance and also the space for the extra column. – ShadowFax Jul 12 '11 at 19:08
yeah I am not concerned about performance of the crawled data. I just want to have a few more pipes as it would help in debugging. – Sarvesh Jul 12 '11 at 20:07

What datastore should I use to store temporary data from crawlers?

1 Answers1