0

Problem statement: I have to develop a solution in which a single source populates a table. Once the table is populated, it is considered as read-only and then we run many read-only queries on it. Such read-only tables are generated by multiple simulation runs: each simulation populates an independent table, meaning there is no cross-write to the tables. However, the read-only queries can be executed on a single or multiple tables. In my environment I have plenty of machines to run the simulations and I can’t use these machines to have a postgres compute farm as a cloud solution. So I can’t use my machines to run endless postgres server jobs as the solution is intended for.

My idea is: Stage 1: Ad-hoc server+client for populating a table: start a server+client on a local machine, populate the table and stop the server+client. The data-dir is hosted in a central file system (e.g. NFS). Stage 2: Ad-hoc server+client for querying the now read-only table(s) from step 1: start a server+client on a local machine, run read-only queries and stop the server+client. In order to implement stage 2 I will:

  1. Create a new ad-hoc empty data-dir
  2. Create a soft-link from each data-dirtable files (including its index files) that is needed for the query to the ad-hoc data-dir. Note that files in multiple data-dirs can be linked to the ad-hoc data-dir.
  3. Update postgress catalog tables in the ad-hoc data-dir according to above soft-links
  4. To guarantee that there will be no modifications of read-only table files, I will implement a table-am (access methods) which registers ONLY the table-am callback functions that are relevant for running read-only queries. Since it is possible to run multiple queries on each table, there can be multiple instances of client-server describes in stage 2 running simultaneously.

Any thoughts? Can it work? My concern is for the process described in stage2#4: can I truly rely on callback functions running read-only queries do not update behind the scene the read-only table files? Any other suggestion to develop & maintain a sustainable solution?

Thanks, Maoz

jeb
  • 78,592
  • 17
  • 171
  • 225
Maoz
  • 1
  • 1
  • What are the reasons to make it so complicated? – Frank Heikens Feb 17 '21 at 08:52
  • 1) Have a lot of data from millions of simulation runs. 2) Not need to keep data from old simulation runs (e.g. older than one month) because they will be obsolete. 3) Have a single writer (table population) at the beginning and later only read-only queries that can run from several machines simultaneously. 4) No dedicated machine(s) to run 24/7 postgres server(s). 5) Unlikely to spend $$$ on a compute postgers farm and database administrator. Have to use the available compute environment. Find a technological solution for it including patching/hacking postgres code. – Maoz Feb 17 '21 at 09:22

2 Answers2

1

That is a crazy idea. You are thinking in terms of a file server.

The solution is to have one central database server that keeps running and contains the tables you want to query. Whoever needs data connects to the database server and runs a query.

If you need an embedded solution, you'll have to use something else than PostgreSQL.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
0

Update postgress catalog tables in the ad-hoc data-dir according to above soft-links

Please don't do this if you mean you actually are going to update catalog tables, it's a very quick way to destroy your database.

Other than that - what Laurenz said. A single server should be able to do all of this, at a stretch some replication to scale and separate the reads from the write based simulation. But it looks like you have designed a complex solution before knowing if you need it and testing the capacity/IO demands of a simpler solution.