11

I have lat/lon coordinates in a 400 million rows partitioned mysql table. The table grows @ 2000 records a minute and old data is flushed every few weeks. I am exploring ways to do spatial analysis of this data as it comes in.

Most of the analysis requires finding whether a point is in a particular lat/lon polygon or which polygons contain that point.

I see the following ways of tackling the point in polygon (PIP) problem:

  1. Create a mysql function that takes a point and a Geometry and returns a boolean. Simple but not sure how Geometry can be used to perform operations on lat/lon co-ordinates since Geometry assumes flat surfaces and not spheres.

  2. Create a mysql function that takes a point and identifier of a custom data structure and returns a boolean. The polygon vertices can be stored in a table and a function can compute PIP using spherical math. Large number of polygon points may lead to a huge table and slow queries.

  3. Leave point data in mysql and store polygon data in PostGIS and use the app server to run PIP query in PostGIS by probviding point as a parameter.

  4. Port the application from MySQL to Postgresql/PostGIS. This will require a lot of effort in rewriting queries and procedures. I can still do it but how good is Postgresql at handling 400 million rows. A quick search on google for "mysql 1 billion rows" returns many results. same query for Postgres returns no relevant results.

Would like to hear some thoughts & suggestions.

Dojo
  • 5,374
  • 4
  • 49
  • 79
  • 7
    I have personal experience running Postgres with 300M+ row tables -- no sweat. Skype uses Pg to track connections, users, accounting, etc. Everything but the communication channel itself. That's many billions of records. – dbenhur Mar 14 '12 at 07:02
  • So how easy/difficult is it to get to 300M? How much of tweaking/optimization did it need? I had read about Skype using Postgres, but big companies can throw resources and get anything to work. What I am looking for is inputs like yours. – Dojo Mar 14 '12 at 07:30
  • 2
    Our PostgreSQL database handles up to ~ 5000 transactions per second, ~600 million records per month for the last 2 years. The previous MySQL server couldn't handle this on the same hardware. – Frank Heikens Mar 14 '12 at 07:36
  • How hard or expensive is it to prototype all four of those options? I'd *guess* that #3 will turn out to have the best cost/performance profile. (Assuming it doesn't require rewriting the application.) – Mike Sherrill 'Cat Recall' Mar 14 '12 at 08:32
  • 2
    @Priyank it doesn't take much tweaking. With a default distribution build, Pg is capable of using large memory, disk, and processor resources to handle large data sets with high concurrency. There's a few default configuration settings sized for smaller systems (eg shared_buffers), but you shouldn't need to change more than a handful of config values to get good perf out of a substantial machine for your workload. You will want to do the same things you might with other db's to manage large data: partition strategy, backup and replication, make sure your storage system is up to it, etc. – dbenhur Mar 14 '12 at 16:02
  • 2
    I've managed to get a postgres database with a little over 2.7 billion records in it and I can get away with queries to tell me how many of those records is within a certain polygon using ST_Contains...and to my experience, Postgres scales to billions far better than Mysql does...I'd recmmoned option 4 from that standpoint. Just to ask...is this geospatial need purely a reporting need? Would leaving MYSQL going for the app and creating a Postgres datawarehouse thats loaded nightly from the MYSQL database possible? Gives you a reporting world independent of your prod environment. – Twelfth Mar 14 '12 at 17:08
  • OpenStreetMap use Postgres take a look http://www.openstreetmap.org/ or http://wiki.openstreetmap.org – edze Mar 22 '12 at 10:50
  • I did some tests. The GIS capabilities are quiet good. But performance on non-GIS operations is not very impressive. An ordered "select * from foo" from a partition with 500,000 records out of which 5000 satisfy the where clause takes about 740 ms in Postgres and takes just 350 ms on Mysql. Tried some other queries with similar results. Besides, Postgres data is on an SSD and MYSQL data is on a regular HDD. – Dojo Mar 23 '12 at 20:24
  • To be fair the MySQL table is partitioned on week and sub partitioned into 10 partitions based a field called stream_id. Postgres partitions on the other hand are partitioned by day alone. So a postgres query scans N stream days of data where as mysql query scans N/10*7 stream days of data (Where N is the number of streams). So postgres load is slightly more and may be it has a non liner effect on query performance. I will try some more things. Any tips are welcome. – Dojo Mar 23 '12 at 20:44
  • @Priyank : I talked to you last time concerning this post http://stackoverflow.com/questions/11694794/vehicle-tracking-and-dispatching-system. I was away for a while. Can you read the last comment under the post so that we can talk further. It seems my email address is in my profile so you can write into it. Thanks – Eddy Freeman Aug 23 '12 at 07:31
  • @Priyank The thing to keep in mind is that the size of your table isn't really the major concern. An index scan of a table with 100 billion records isn't that much more expensive than an index scan of a table with 100 million records. The question of performance ends up being "what are you doing? How many rows are you processing per run? How expensive is that processing?" The key to managing performance is to manage those questions. If you can't then go Postgres-XC with all the complexity that adds. – Chris Travers Sep 20 '12 at 01:48
  • Just to give you an example on a significantly smaller db, have a customer with 10-12M rows in one table, spanning maybe 5-10 years. Need a query to find out which years data is represented for. If you do a select distinct.... or group by, you have something that is not suited for OLTP, but if you write a recursive query that pulls the next lowest year until it runs out, it returns very fast. So managing the number of rows processed cuts processing time there by more than a factor of 1000. – Chris Travers Sep 20 '12 at 01:51
  • Ok, 6 years after the original question. I ported to the app PostgreSQL (not now, long back). I find it much more powerful than MySQL. PostgreSQL is now my preferred database even when I don't need spatial extensions. Postgres has scaled very well and the regular version updates inspire confidence in its future. – Dojo Aug 22 '18 at 12:09

2 Answers2

4

A few thoughts.

First PostgreSQL and MySQL are completely different beasts when it comes to performance tuning. So if you go the porting route be prepared to rethink your indexing strategies. Not only does PostgreSQL have a far more flexible indexing than MySQL, but the table approaches are very different also, meaning the appropriate indexing strategies are as different as the tactics are. Unfortunately this means you can expect to struggle a bit. If i could give advice I would suggest dropping all non-key indexes at first and then adding them back sparingly as needed.

The second point is that nobody here can likely give you a huge amount of practical advice at this point because we don't know the internals of your program. In PostgreSQL, you are best off indexing only what you need, but you can index functions' outputs (which is really helpful in cases like this) and you can index only part of a table.

I am more a PostgreSQL guy than a MySQL guy so of course I think you should go with PostgreSQL. However rather than tell you why etc. and have you struggle at this scale, I will tell you a few things that I would look at using if I were trying to do this.

  • Functional indexes
  • Write my own functions for indexes for related analysis
  • PostGIS is pretty amazing and very flexible

In the end, switching db's at this volume is going to be a learning curve, and you need to be prepared for that. However, PostgreSQL can handle the volume just fine.

Chris Travers
  • 25,424
  • 6
  • 65
  • 182
2

The number of rows is quite irrelevant here. The question is how much of the point in polygon work that can be done by the index.

The answer to that depends on how big the polygons are.

PostGIS is very fast to find all points in the bounding box of a polygon. Then it takes more effort to find out if the point actually is inside the polygon.

If your polygons is small (small bounding boxes) the query will be efficient. If your polygons are big or have a shape that mekes the bounding box big then it will be less efficient.

If your polygons is more or less static there is work arounds. You can divide your polygons in smaller polygons and recreate the idnex. Then the index will be more efficient.

If your polygons is actually multipolygons the firs step is to split the multipolygons to polygons with ST_Dump and recreate and build an index on the result.

HTH

Nicklas

Nicklas Avén
  • 4,706
  • 1
  • 18
  • 15
  • The individual points (~ 400 million) are to be stored in the database anyway. PIP is another problem. If you are referring to point 2, in that case its a mysql table that stores polygon vertices and a UDF runs a query on the table to determine PIP result. – Dojo Mar 23 '12 at 20:15