2

I'm facing the challenge of generating 'balance' values from thousands of entries in a PG table.

The rows in the table have many different columns, each useful in calculating that rows contribution to the balance. Each row/entry belongs to some profile. I need to calculate the balance value for some profile, from all entries belonging to that profile according to some set of rules. Complexity should be O(N) - N being the number of entries that belong to the profile.

The different approaches I took:

  1. Fetching the rows, calculating balances on backend. This degrades doesn't scale well and degrades quickly, depending on the number of entries that belong to the profile. While fetching the entries is initially fast, once a profile has over 10,000 entries it becomes prohibitively slow. I figured that a lot of time is being spent on transport, additionally we don't really need the rows only the balances. Since we already do the work of finding the entries, we can calculate the balance and save time on backend calculations as well as the transport of thousands of rows, thus leading to the second approach:
  2. The second approach was creating a PG query that iterates over the rows and calculates the balance. This has proven to be more scalable when there are many entries per profile. This approach however, probably due to the complexity of the PG query, puts a lot of load on the database. It's enough to run 3-4 of these queries concurrently to max out the database CPU.
  3. the third approach is to create a PL/pgSQL function to loop over the relevant entries and return the rows, hoping to reduce the impact on the database. This is the next thing I want to try.

Main question is - what would be the most efficient way to achieve this while being 'database friendly'?

Additionally:

  • Whether you think these approaches are sane?
  • Am I missing another obvious solution?
  • Is it unlikely that I improve over the performance of the query with the help of a function looping over same rows as the query, or is worth trying?

I realize I haven't provided a lot of concrete data, but I figured that since this is probably a common problem, maybe the issue can be understood from a general description.

EDIT:

To be a but more specific, I'm dealing with the following data:

CREATE TABLE entries (
  profileid bigint NOT NULL,
  programid bigint NOT NULL,
  ledgerid text NOT NULL, -- this provides further granularity, on top of 'programid'
  startdate timestamptz,
  enddate timestamptz,
  amount numeric NOT NULL
)

What I want to get is the balances for a certain profileid, separate by (programid, ledgerid).

The desired form is:

RETURNS TABLE (
    programid bigint, 
    ledgerid text, 
    programid bigint, 
    currentbalance numeric, 
    pendingbalance numeric, 
    expiredbalance numeric, 
    spentbalance numeric
)

The four balance values are produced by applying arithmetic on certain entries. For example, negative amount would only add to spentbalance, expired balance is generated from entries that have a positive amount and the enddate is after now(), etc...

While I did manage to create a very large aggregate query with many calls to COALESCE(SUM(CASE WHEN ... amount), 0), I was wondering if I have anything to benefit from porting that logic into a PL/pgSQL function. However, when trying to implement this function I realized I don't know how to iterate over one function and return another, different in columns and rows, function. Should I use a temp table for this? Seems like an overkill as this query is expected to execute tens of times every second...

Alechko
  • 1,406
  • 1
  • 13
  • 27
  • 3
    I always try to use the set based approach and use just one (big) query to do all the work just once, for all records involved. This works for me, in my area, even for millions of records. – Frank Heikens Oct 05 '22 at 11:31
  • 2
    Option 2 whenever possible. When you use SQL queries containing `SUM(something) ... GROUP BY person` aggregations, you can create indexes that will make those queries reasonably fast. You can also consider creating a [materialized view](https://www.postgresql.org/docs/current/sql-creatematerializedview.html) if you can tolerate a llittle staleness. And, SQL is a declarative language (you say want you want), but you're thinking about your requirement procedurally (you say how to get it). More specific advice must await more details from you. Please [edit] your question. – O. Jones Oct 05 '22 at 12:55
  • I've added some information. Unfortunately, a materialized view would not work here due to data integrity and latency reasons. – Alechko Oct 05 '22 at 13:11
  • What is the difference between 1) and 3)? You are still returning all the rows to the client, just wrapping them the call into pl/pgsql? – jjanes Oct 05 '22 at 14:58
  • Is the problem with 2) that it takes much more CPU on the server than the same thing takes on the client, or is the problem that database CPUs are more precious than the client CPUs? – jjanes Oct 05 '22 at 15:07
  • @jjanes in scenario 2 and 3 the rows will not be returned to the client. Calculation will be done in the database and only balances would be returned. The issue is that I want to prevent or reduce degradation over time. As the number of rows per profileid grows in, let's say linear fashion, I want to minimize the number of tasks that depend on the amount of entries. The main idea was - we go over them anyway when fetching from the database. Let's calculate the balance in-flight while fetching them and not need to transfer the entries over the wire, only the balances. – Alechko Oct 05 '22 at 15:12
  • In terms of resources - database resources are more previous because the client (backend in this scenario) is easily scalable. – Alechko Oct 05 '22 at 15:13
  • 3
    pl/pgsql is generally pretty inefficient. It is very unlikely it would be faster than an equivalent query in plain SQL. There are exceptions, but nothing in your description makes me think they would apply to you. Between 1) and 2), it is just a question of nitty-gritty details. How many rows are being returned, how fast is your network, and what is the EXPLAIN (ANALYZE, BUFFERS) for the in-SQL aggregation? In both cases, are they faster after repeat back-to-back execution? – jjanes Oct 05 '22 at 15:42

0 Answers0