1

We're using postgres as a data warehouse, which we manage with dbt.

In it, there is a very large and thin table like this (~10 million rows):

col1 col2
'value1' 123
... ...
'valueN' 123

Also, there is a data enrichment API that accepts the existing columns as parameters and returns the relevant enriched data:

https://api.myapi.com/&q=value1+123

Each API call returns in ~1sec

What is/are performant way/s to enrich the data without leaving the data warehouse such that I can add a third column like so:

col1 col2 enrichedCol
'value1' 123 enriched1
... ... ...
'valueN' 123 enrichedN

Ideally, this is done with SQL or a stored procedure.

Snowflake has External Functions that achieve something similar.

We've considered an asyncio in a plpython3u stored procedure approach, but I feel that postgres should have a better solution.

An explanation of why doing this is an anti-pattern is also acceptable.

Joey Baruch
  • 4,180
  • 6
  • 34
  • 48
  • 1
    For context, what's wrong with grabbing the data from the API at the application level when it's needed? – esqew Dec 15 '21 at 20:51
  • Can't you just add a new column to your existing table instead of duplicating the whole table content into a new one ? – Edouard Dec 15 '21 at 20:51
  • There is no application level, this is a dbt project and a OLAP use case – Joey Baruch Dec 15 '21 at 21:22
  • @Edouard - yeah, that works too – Joey Baruch Dec 15 '21 at 21:23
  • dbt is just a code-generating-like tool, it compiles the models/macros/custom test/etc to SQL and executes them against the targetted database(s). If you've to perform API calls within dbt = you have to perform API calls by using SQL, and so to use `plpython3u` - that looks like the only solution. But you should consider not to follow this approach as the performance issue with huge dataset. – Dat Nguyen Dec 21 '21 at 08:56

2 Answers2

1

Postgres and dbt guy here, looks like so far this would have been better received on the discourse than the stack-overflow but imo, great question.

Basically, the way I'm interpreting this so far, you are asking about patterns and to be fair, the vast majority of dbt users so far have only used 1, maybe 2 of the architectural "pattern" from what I've seen here. That will change over time as I believe Spark, Firebolt, and the "next-gen" datastores will likely support features like "api to db pipelines as sql" very soon.

I can think of at least 3 options here:

  1. Dump to "data lake"
  2. ETL / ELT Engine
  3. Polling Service (as you suggested)

This would be how I roughly illustrate these patterns for documentation or a project proposal:

enter image description here

To describe those a little further:

  1. Dump everything from the API in bulk to a file (or even better a snapshotted file just in case some data is later redacted) and then make that data available via dbt-external-tables as a view so that you are always merging the latest from the archive at the merge view query time.

Example of this is using a directory structure on a cloud storage service like in the following write-up: Loading and transforming data into BigQuery using dbt

  1. Use an external service from the database to poll the api with data from the database. (Probably the slowest overall option since the triple network IO of read from DB, read from API, and then write to DB.)

** edit ** (Additional point for clarity, this is NOT the same as a source read-only ELT service like Fivetran or Airbyte. Here I'm referring to the pattern of using a service that is capable of doing a read/update or read/upsert operation for the circumstances where that is required like watching an Address table in a DB and then enriching that information with a sanitized version of the address from Google Places API or something equivalent).

  1. Create a procedure (ala this SO Postgres Question which, combined with a trigger, could return a value from an external service via a standard cURL call. (My view on the problem with this will be maintainability - how will you be alerted if the API changes and extracts unexpected values, how do you log timing & success of the procedure calls, does your DE have control over the DB as well as the dbt code base and can perfectly anticipate how dbt changes will effect this proc?, etc.)

Happy to expand further but would generally recommend re-opening this question over in the [postgres][rest][webservices] tags since dbt is mostly irrelevant to the question so far.

sgdata
  • 2,543
  • 1
  • 19
  • 44
  • Thanks, @sgoley! Do you have any literature recommendations for option 2 (dbt + airflow)? – Joey Baruch Dec 21 '21 at 16:05
  • @JoeyBaruch So personally, I would prefer to discuss this more directly with you (see profile to msg on linkedin, etc.) since that is very close to exceeding the "tool-recommendation" guideline of stack-overflow. However, will trying adding some generic examples to my answer with an edit. Also may be better to ask on /r/dataengineering reddit as well! – sgdata Dec 22 '21 at 17:53
0

PG should deal with 10mln! :) I would add column to the table, accepting nulls. Then in small chunks, I would update, row by row.

Key factor here is not to lock your table for too long. Also smaller transactions are better in terms of reusing space.

First step allows fast exclusive lock to be really short. That only changes metadata of the table - since nulls are allowed, last column is null by default, if row ends sooner - all columns after that are null by design.

Next changing - should be done in chunks. That way free-space-map can handle each chunk, and next one should reuse space left by previous one. Read about MVCC if you need more information. Update changes indexes (also MVCC explains that), that adds to FSM, also if you need particular data order, index structure - be aware.

Question about anti-patern. Well, if you have all the data at the server, and your PG is quite new - you can do it with stored procedure / anonymous block. Just commit every 1000-2000 updates. Other option - external script. I would recommend triggering it from the server with screen... Just as good practice. On the other hand - it should just pick next 1000-2000 rows with nulls, so failure should not be a big deal. If that is too much for the server - due to high load from other sources - just add some some sleep.

Michał Zaborowski
  • 3,911
  • 2
  • 19
  • 39