8

Table products:

id int primary_key

Table transactions:

product_id int references products

The below SQL query is very slow:

SELECT products.* 
FROM   products 
       LEFT JOIN transactions 
              ON ( products.id = transactions.product_id ) 
WHERE  transactions.product_id IS NULL; 

Out of 100 hundred million products records, there might be only 100 records where a product has no corresponding transactions.

This query is very slow as I suspect it is doing a full table scan to find those null foreign key product records.

I want to create a partial index like (pseudo-code):

CREATE INDEX products_with_no_transactions_index 
ON (Left JOIN TABLE 
    BETWEEN products AND transactions) 
WHERE transactions.product_id IS NULL;

Is the above possible and how would I go about it?

Some characteristics of this data set:

  1. Transactions are never deleted and only added.

  2. Products are never deleted but added at a rate of 100s per minute (obviously this is a made up example behind a much more complex actual use case). A small percentage of those are temporarily orphaned.

  3. I need to frequently query (up to once per minute) and need to always know what the current set of orphaned products are.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
samol
  • 18,950
  • 32
  • 88
  • 127
  • You could add a column 'last_transaction_id' in the products table, set up a trigger on insert, then search on the products table for last_transaction_id is null. – Keith John Hutchison Jan 02 '14 at 00:05
  • I am trying not to touch the existing schema. Is creating an index not possible? Alternatively, I can create any index as long as it doesn't involve changing the schema or need to write a insert trigger – samol Jan 02 '14 at 00:07
  • Actually, I am ok with anything that doesn't involve changing the schema as long as I can get a fast look up – samol Jan 02 '14 at 00:08
  • The FK constraint automagically constructs an index for the product_id column. But: you are looking for orphanosed rows, an that takes time. They can sit anywhere, and at least the whole index plus the products table (or it's index) have to be consulted to find them. You are looking for hundred needles in a haystack of 100M. Why do you want to find them, if this is part of your core logic, something in your data model is seriously wrong, IMHO. If it is only maintenance: deal with it. – wildplasser Jan 02 '14 at 00:09
  • By the way, I am postgresql 9.3 so would it be possible to create a materialized view and index that? I haven't personally materialized view and I am unfamiliar with it – samol Jan 02 '14 at 00:12
  • BTW : `The below SQL Query is very slow:` is not a fact. Not without a query plan + the tuning + the popcount of the tables. Throwing materialised views at it will only get you more confused, IMHO. – wildplasser Jan 02 '14 at 00:23
  • ok. Fair enough. This is for legacy reason and I do really need to be able to query them often – samol Jan 02 '14 at 00:25
  • 1
    I took the liberty to fix the query to match your table definition according to your question. – Erwin Brandstetter Jan 02 '14 at 00:39
  • Looks better with your fix. Thanks – samol Jan 02 '14 at 00:42
  • @wildplasser I wanted to get your opinion on this answer. It looks good to me but what do you think as you didn't seem to like the materialized view approach. http://stackoverflow.com/questions/20874569/index-to-find-records-where-the-foreign-key-does-not-exist/20875013#20875013 – samol Jan 02 '14 at 01:49
  • Erwin's answer looks fine to me. Personally, I think that you should rethink your data model: create an orphanage (like in Erwin's final remarks), or add a few triggers and make it more state-driven. – wildplasser Jan 02 '14 at 09:01

2 Answers2

7

The best I can think of is your last idea in the comments: a MATERIALIZED VIEW:

CREATE MATERIALIZED VIEW orphaned_products AS
SELECT *  -- or just the columns you need
FROM   products p
WHERE  NOT EXISTS (SELECT FROM transactions t WHERE t.product_id = p.id);

Then you can use this table (a materialized view is just a special table) as drop-in replacement for the big table products in queries working with orphaned products - with obviously great impact on performance (a few 100 rows instead of 100 millions). Materialized views require Postgres 9.3, but that's what you are using according to the comments. You can implement it by hand easily in older versions.

However, a materialized view is a snapshot and not updated dynamically. This might void any performance benefit. To update, you run the (expensive) operation:

REFRESH MATERIALIZED VIEW orphaned_products;

You could do that at strategically opportune points in time and have multiple subsequent queries benefit from it, depending on your requirements.

Of course, you would have an index on orphaned_products.id, but that would not be important for a small table of a few hundred rows.

If your model is such that transactions are never deleted, you could exploit that to great effect. Create a similar table by hand:

CREATE TABLE orphaned_products2 AS
SELECT *  -- or just the columns you need
FROM   products p
WHERE  NOT EXISTS (SELECT FROM transactions t WHERE t.product_id = p.id);

You can refresh that "materialized view" just like the first one by truncating and refilling it. But the point is to avoid the expensive operation. All you actually need is:

  • Add new products to orphaned_products2.
    Implement with a trigger AFTER INSERT ON products.

  • Remove products from orphaned_products2 as soon as a referencing row appears in table transactions.
    Implement with a trigger AFTER UPDATE OF product_id ON transactions. Only if your model allows transactions.products_id to be updated - which seems odd for transactions.
    And another one AFTER INSERT ON transactions.

All comparatively cheap operations.

If transactions can be deleted, you'd need another trigger to add orphaned products AFTER DELETE ON transactions - which would a bit be more expensive. For every deleted transaction you need to check whether that was the last referencing the related product, and add an orphan in this case. May still be a lot cheaper than to refresh the whole materialized view.

VACUUM

After your additional information I would also suggest custom settings for aggressive vacuuming of orphaned_products2, since it is going to produce a lot of dead rows.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Hey Erwin, This is excellent. There are a couple of things that I can confirm. Transactions are never deleted and only added. Products are also never deleted but added at a rate of 100s per minute (obviously this is a made up example behind a much more complex actual use case). So I need it to be freshed and always know what the current orphaned products are. – samol Jan 02 '14 at 01:31
  • Your answer looks excellent btw, I am not familiar with materialized views so I am still digesting through your answer – samol Jan 02 '14 at 01:32
  • @wildplasser seemed not to be a fan of the materilized view idea. I wonder how he thinks of this apporach. It looks good to me as the materialized view table is small – samol Jan 02 '14 at 01:48
  • @alumns: The triggers make write operations more expensive. You should take care to make them as effective as possible, and you need to evaluate whether the resulting cost is acceptable. And the additional table introduces additional complexity making rooms for bugs and possibly weakening the strictness of referential integrity. It all depends on the details .. – Erwin Brandstetter Jan 02 '14 at 01:54
  • Thanks for the detailed answer. I am doing about 100 writes per minute. So slightly slower writes are actually fine! Your solution is really good. I am going to try it and see whether it works out well – samol Jan 02 '14 at 07:49
  • Hey Erwin, would you be able to expand your answer a little bit more why aggreesive vaccuuming is necessary to set up in this case? – samol Jan 02 '14 at 08:26
  • @alumns: Since you are going to enter like 100 rows per minute and delete most of them soon thereafter, you will have a lot of dead row versions bloating the table. To remove bloat, you need to run VACUUM, and probably more often that the standard autovacuum daemon would schedule. [More in the manual here](http://www.postgresql.org/docs/current/interactive/routine-vacuuming.html). – Erwin Brandstetter Jan 02 '14 at 15:33
0

I have given it a go tried with some test data and found a way which I think is a faster, USING THE EXCEPT OPERATOR

Following are my finding.

Test Data

CREATE TABLE TestTable_1 
(ID INT IDENTITY PRIMARY KEY,
Column1 INT,
Column2 INT
)
GO

CREATE TABLE TestTable_2
(ID INT IDENTITY PRIMARY KEY,
Column1 INT,
Column2 INT,
FK_ID INT references TestTable_1(ID)
)
GO

DECLARE @i INT = 1

WHILE (@i <= 10000)
 BEGIN
   INSERT INTO TestTable_1 (Column1, Column2)
   VALUES (@i , @i + 100)
   SET @i = @i + 1;
 END

 DECLARE @i2 INT = 1

WHILE (@i2 <= 10000)
 BEGIN
   INSERT INTO TestTable_2 (Column1, Column2, FK_ID)
   VALUES (@i2 , @i2 + 100, 1 + CONVERT(INT, (10000-1+1)*RAND()))
   SET @i2 = @i2 + 1;
 END

 UPDATE  TestTable_2
 SET FK_ID = NULL
 WHERE ID IN (SELECT TOP 10 ID FROM TestTable_2 ORDER BY NEWID())

Filtered Index on Table Two

CREATE NONCLUSTERED INDEX FIX_FK_ID
ON TestTable_2(ID, FK_ID)
WHERE FK_ID IS NULL ;
GO

Query 1

SET STATISTICS IO ON;
PRINT 'TEST 1'
SELECT T1.*
FROM TestTable_1 T1 LEFT JOIN TestTable_2 T2
ON T1.ID = T2.FK_ID
WHERE FK_ID IS NOT NULL

Query 2

PRINT 'TEST 2'
SELECT ID,  Column1,    Column2 FROM TestTable_1
EXCEPT 
SELECT ID,  Column1,    Column2 FROM TestTable_2
WHERE FK_ID IS NULL

TEST 1

(9990 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestTable_1'. Scan count 1, logical reads 28, physical reads 0, read-ahead reads 19, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestTable_2'. Scan count 1, logical reads 33, physical reads 3, read-ahead reads 29, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


TEST 2

(9990 row(s) affected)
Table 'TestTable_1'. Scan count 1, logical reads 28, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'TestTable_2'. Scan count 1, logical reads 22, physical reads 1, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The Execution Plans for Both Queries

Execution Plans

M.Ali
  • 67,945
  • 13
  • 101
  • 127
  • 1
    Looks like SQL Server code, which is not fully applicable to Postgres. Did you see the `[postgresql]` tag? – Erwin Brandstetter Jan 02 '14 at 01:40
  • @ErwinBrandstetter [facepalm] missed that :( – M.Ali Jan 02 '14 at 01:42
  • @ErwinBrandstetter but do you reckon my findings are acceptable or even worthy of looking at it ?? I am not really a DBA guy but recently have been trying to learn more about Performance optimization. Please I would appreciate some feed back thank you. – M.Ali Jan 02 '14 at 01:45
  • I am not as familiar with SQL Server code, so I can hardly comment on that. `EXCEPT` wouldn't buy you much in Postgres, where `NOT EXISTS` typically performs faster. Out of time now, too. – Erwin Brandstetter Jan 02 '14 at 02:01
  • @ErwinBrandstetter My guess that `EXCEPT ...` and `NOT EXISTS (...)` would yield the same or similar query plans. – wildplasser Jan 02 '14 at 15:16
  • If you are going the Materialized View route, my question and answer on this may be relevant - http://postgresql.1045698.n5.nabble.com/Refresh-Materialized-View-td5784982.html – Jayadevan Jan 03 '14 at 04:20