2

I am trying to store a large amount of daily weather data into a postgreSQL database. This might not seem like it would be a lot of data but there are roughly 95,000 stations with daily data going back as much as 100 years. This can mean many million records (95,000 * 365 * 100) = 3,467,500,000. Although this is an overestimate it still seems impractical to me to store all the daily data in a single table with station ids as a foreign key mapping to another table with the station information. What would be the best way to structure this data for querying the data series by station? Should I create a table for each station (would result in 95,000 tables) or should I try something broader like a table for each region? What are the advantages and disadvantages? Any help is greatly appreciated.

My data looks like this:

Stations
*ID
-longitude
-latitude
-elevation
-country
-state
-name
...

Weather
*Station ID
*Date
-Precipitation
-High Temp
-Low Temp
klib
  • 697
  • 2
  • 11
  • 27
  • Why not use table partitioning? The database take care of creating & maintaining the 95000 separate tables for you: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html – Code Different Jun 03 '15 at 13:36
  • 1
    Alas, in PostgreSQL there's no built-in partitioning, you have to basically roll your own or use external tools like pg_partman. It also doesn't scale well to many hundreds or thousands of tables. I strongly suspect the best option is keeping things simple with a few big tables. – Craig Ringer Jun 03 '15 at 13:46
  • Partitioning by date seems the most logical. At 34M rows/year; it could be per year or per 5 or 10 years. – wildplasser Jun 03 '15 at 14:02
  • @wildplasser what's the point of partitioning by date if OP would like to query data by station? – Radek Postołowicz Jun 03 '15 at 14:31
  • You might have a point. But I suspect the station_ids will not be distributed uniformly, so the split could get ugly. ( 95K partitions is not an option; He'll have to aggregate anyway) – wildplasser Jun 03 '15 at 14:37
  • The numbers are no sweat for an SQL platform, no table partitioning or other methods necessary. The code remains simple, you don't have to look in 95,000 places for one thing, and aggregation (by Station or Date or Prcipitation/High/low) is effortless. But PostgreNONsql will cack itself. – PerformanceDBA Jun 05 '15 at 19:43

1 Answers1

2

That's not really enough information.

What are you optimizing for: query performance, disk usage, update speed?

  • What kinds of queries are you running?
  • Are you typically fetching all data for a station (seems unlikely)? A date range?
  • If you query by date, what's the typically resolution: day, month, year?
  • Are those all the fields in the 'weather' table, or is that just a sample?
  • Do you typically retrieve a single value, or many different ones?
  • Are you just retrieving those values, or doing aggregation/analytics in the database?
  • What is acceptable query performance for you?

Depending on your answers to those, it may make sense to "bunch up" your data (store more than a day per record; I'm assuming that 'date' means it's a single day, or is it more granular?), to reduce the total number of rows. Postgres has a relatively high per-row overhead - in your estimate, just the row headers would take up ~75GB.

Alternatively, you may want to investigate something like this: https://github.com/citusdata/cstore_fdw

The advantages of using more tables are smaller index sizes and (possibly) physical data locality. In the extreme case of one table per station_id (not practical in your case), you wouldn't need an index on station_id at all, and the query may end up being a simple seq scan over the data you need.

The disadvantages are that many db operations involve a linear scan over all tables (especially during planning) and more complexity in managing the database.

The typical advice is to keep the number of tables to a few hundred to maybe a few thousand. Unless, of course, you have an atypical case, and you've tested it, and it works for you.

Dmitri
  • 8,999
  • 5
  • 36
  • 43