4

We're currently using Postgres 9 on Amazon's EC2 and are very satisfied with the performance. Now we're looking at adding ~2TB of data to Postgres, which is larger than our EC2 small instance can hold.

I found S3QL and am considering using it in conjunction with moving the Postgres data directory to S3 storage. Has anyone had experience with doing this? I'm mainly concerned with performance (frequent reads, less frequent writes). Any advice is welcome, thanks.

Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
Growth Mindset
  • 1,135
  • 1
  • 12
  • 28
  • Very interesting. Sounds like cloud-NFS on steroids. The name, however, does not sound "accurate" -- why did they make it look like "SQL", I don't know -- it sounds like a cloud filesystem and *not* a database. I would suspect trying to use databases over S3QL would be ... interesting, just as it is with NFS, etc. In any case, the entire structure of the data would have to be re-examined. –  Dec 14 '11 at 18:55
  • From the linked site: "*S3QL is a standard conforming, full featured **UNIX file system** that is conceptually indistinguishable from any local file system.*" –  Dec 14 '11 at 18:58
  • 1
    I know the PostgreSQL performance tuning manuals advise against NFS or "other remote file systems" due to performance issues and possible data corruption. I'd imagine the link between EC2 and S3 is pretty dang fast though, but 2TB is a huge amount of data to be serializing across the wire. I'd say do some serious testing on this before you commit.. – Mike Christensen Dec 14 '11 at 19:25

1 Answers1

1

My advice is "don't do that". I don't know anything about the context of your problem, but I guess that a solution doesn't have to involve doing bulk data processing through PostgreSQL. The whole reason grid processing systems were invented was to solve the problem of analyzing large data sets. I think you should consider building a system that follows standard BI practices around extracting dimensional data. Then take that normalized data and, assuming it's still pretty large, load it into Hadoop/Pig. Do your analysis and aggregation there. Dump the resulting aggregate data into a file and load that into your PG database along-side the dimensions.

Andrew
  • 1,027
  • 1
  • 11
  • 17
  • Yes we gave this a shot and the results weren't good. SUPER slow data access and expensive to boot. In the end we were able to use Elastic Search to index the data the way we wanted with a parent/child doc relationship. Works great for us. – Growth Mindset Jun 18 '12 at 22:09