0

I'm starting to build out a project using MySQL and now starting to think that SimpleDB might be more appropriate. (My reason for potentially using SimpleDB over another NoSQL solution is that it's easy to use with EC2).

I have a series of spiders scraping information on widgets using the Python framework, Scrapy, and the Django ORM to put the results into a MySQL db. I'll be building out a website that makes use of this data. I'm thinking that SimpleDB might be more appropriate because:

  1. Some of the sites have fields specific to them and so the schema may be subject to change when I come across these. SimpleDB obviously allows for a lot more flexibility here
  2. I'm going to be collecting info on around 5m widgets a year. My sense is that MySQL can handle this but figuring out the indexes might be a hassle. SimpleDB will offer assured performance at scale

The cons I can see are that writing queries will be more complex, I'll need to pre-aggregate more and general unfamiliarity with NoSQL.

Questions:

  1. Which option would you recommend?
  2. How would you approach integrating Python/ Django with SimpleDB? Is django-norel worth looking at?
  3. Are there any other issues I'll likely encounter with SimpleDB?
cdeszaq
  • 30,869
  • 25
  • 117
  • 173
alan
  • 4,247
  • 7
  • 37
  • 49
  • 1
    Scraping and storing data is only 1/2 of the issue. What are you going to be doing with it? In most cases, what you _do_ with the data matters much more than how you get and store it, and that is what should drive decisions like this. – cdeszaq Oct 04 '11 at 13:39
  • Initially a series of reports which will need to use some aggregates, e.g. number of widgets with certain attributes. Later a website that will need to access the individual widget records as well as aggregates – alan Oct 04 '11 at 15:20
  • SimpleDB doesn't provide aggregate functions, which means you would have to pull all the data from the store and aggregate yourself (or pre-aggregate like you indicate, but that only goes so far) to generate the report. Depending on how big your data is, this may not be performant or cost-effective. – lreeder May 10 '13 at 14:37

0 Answers0