0

Here is the thing.

I have a report databse which is used by pentaho, for generating reports. This DB is running on the same machine as pentaho-server (v7.1).

This report DB is being filled from about 90 other databases spread across the country. Theirs number is increasing.

Because, data-integration is also a Java application, it started to require too much computing power and pentaho web app was too slow. What we did was, that we move fetches to separate machines. Where those Java apps run, and load data into report DB on webserver.

BUT, this change did not bring expected results. While decreasing Load Average on main machine significantly (from about 70 to about 12).

But postgres itself still drains too much power (and is too slow), because there are constantly like 20~30 processes on another machine feeding report DB with new data. There are of course about 90 fetch processes, but they never run all at once, but also never run less than 20 at once.

I was expecing the new machine where fetches run, to be high Load Average while web server will be low Load Average when no report is being generated.

So my question is: How to make fetches use computing power of secondary machine, when loading data into primary machine?

(I was also thinking about writing my own script in python that will do less DB operations during fetch, but that would't solve my problem, just buy me more time.)

I was looking at Citus, but I am not sure if it is exactly what I need, and if it makes sense being used on just 2 machines.

So basically my qustion is: Is there any way, how to use computing power of my Pc when inserting data into remote DB?

The more native to postgres solution will be, the better. Ideally without the need of any 3rd party software.

rRr
  • 370
  • 1
  • 5
  • 17
  • Sharding with foreign data wrappers comes to mind –  Dec 05 '17 at 12:09
  • I'd suggest to identify first what kind of resource is a bottleneck for your use case. Usually storage write speed limits insert/update performance (at least for a single process), storage read speed can limit report generation, but none of these resources affects responsiveness of web-server. If your performance is really disk-bound, then sharding seems to me the only option to utilize other machines. Other options could be: faster storage, organize integration to utilize batch loading and avoid lock waits for concurrent writes, optimize DB structure to require less space (mind indexes too). – Andrei Luksha Dec 06 '17 at 10:46
  • Forgot to mention that disk write/read speed may affect performance of order by/group by/distinct operations, if the queried dataset does not fit into workmem. It also may be an important factor while optimizing performance. In any case, it would be good to know the actual bottleneck for most critical use-cases. – Andrei Luksha Dec 06 '17 at 10:54

0 Answers0