-1

I managed a health-care database which is hosted in AWS RDS. The system info as below:

  • PostgreSQL 9.6
  • 8 v-cores and 16GB RAM
  • DB size now is 35GB

The problem is I want to join few thousand users in accounts tables with other health-metric tables (up to 10, and few millions record per table) to make a custom data report (using Google Data Studio). Here what I did:

  • Join all the needed tables as one materialized view.
  • Feed Google Data Studio by this materialized view.

But, I have waited 10 hours and it still runs without end. I thought it will never finished. Does anyone experience in huge data report? Just give me the keywords.

Here is my materialized view definition:

CREATE MATERIALIZED VIEW report_20210122 AS
SELECT /* long, but simple list */
FROM accounts
INNER JOIN user_weartime ON accounts.id = user_weartime.user_id
INNER JOIN admin_exchanges ON accounts.id = admin_exchanges.user_id
INNER JOIN user_health_source_stress_history ON accounts.id = user_health_source_stress_history.user_id
INNER JOIN user_health_source_step_history ON accounts.id = user_health_source_step_history.user_id
INNER JOIN user_health_source_nutri_history ON accounts.id = user_health_source_nutri_history.user_id
INNER JOIN user_health_source_heart_history ON accounts.id = user_health_source_heart_history.user_id
INNER JOIN user_health_source_energy_history ON accounts.id = user_health_source_energy_history.user_id
INNER JOIN user_health_source_bmi_history ON accounts.id = user_health_source_bmi_history.user_id
where accounts.id in (/* 438 numbers */);
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
Tai Tran
  • 1,406
  • 3
  • 15
  • 27

3 Answers3

1

Creating a materialized view for a huge join is probably not going to help you.

You didn't show us the query for the report, but I expect that it contains some aggregate functions, and you don't want to report a list of millions of raw data.

First, make sure that you have all the appropriate indexes in place. Which indexes you need depends on the query. For the one you are showing, you would want an index on accounts(id), and (if you want a nested loop join) on admin_exchanges(user_id), and similarly for the other tables.

But to find out the correct indexes for your eventual query, you'd have to look at its execution plan.

Sometimes a materialized view can help considerably, but typically by pre-aggregating some data.

If you join more than 8 tables, increasing join_collapse_limit can give you a better plan.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
0

I would ask you why are you using direct connection to PostgreSQL to display data in DataStudio. Despite this being supported, this only makes sense if you don't want to invest time in developing a good data flow (that is, your data is small) or if you want to display real time data.

But since your data is huge and you're using a Materialized View, I guess none of these are the case.

I suggest you to move to BigQuery. DataStudio and BigQuery play really nice together, and it is made to process huge amounts of data very fast. I bet your query would run in seconds in BigQuery and it'll cost cents.

Sadly, BigQuery only supports Cloud SQL external connectors and it can't connect directly to your AWS RDS service. You'll need to write a ETL job somewhere, or move your database to Cloud SQL for PostgreSQL (which I recommend, if it is possible).

Check out these answers, if you're interesting in transfer data from AWS RDS to BigQuery:

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Diego Queiroz
  • 3,198
  • 1
  • 24
  • 36
0

I changed my idea and know how to do that using FULL JOIN ON start_date AND user_id, then each health metrics should be a columns in huge view. My report now has more than 500k rows and 40 columns but the view creation still very FAST and also the Query time on view

Tai Tran
  • 1,406
  • 3
  • 15
  • 27