This post has been totally rephrased in order to make the question more understandable.
Settings
PostgreSQL 9.5
running on Ubuntu Server 14.04 LTS
.
Data model
I have dataset tables, where I store data separately (time series), all those tables must share the same structure:
CREATE TABLE IF NOT EXISTS %s(
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
GranulityIdIn INTEGER,
GranulityId INTEGER NOT NULL,
TimeValue TIMESTAMP NOT NULL,
FloatValue FLOAT DEFAULT(NULL),
Status BIGINT DEFAULT(NULL),
QualityCodeId INTEGER NOT NULL,
DataArray FLOAT[] DEFAULT(NULL),
DataCount BIGINT DEFAULT(NULL),
Performance FLOAT DEFAULT(NULL),
StepCount INTEGER NOT NULL DEFAULT(0),
TableRegClass regclass NOT NULL,
Updated TIMESTAMP NOT NULL,
Tags TEXT[] DEFAULT(NULL),
--
CONSTRAINT PK_%s PRIMARY KEY(Id),
CONSTRAINT FK_%s_Channel FOREIGN KEY(ChannelId) REFERENCES scientific.Channel(Id),
CONSTRAINT FK_%s_GranulityIn FOREIGN KEY(GranulityIdIn) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_Granulity FOREIGN KEY(GranulityId) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_QualityCode FOREIGN KEY(QualityCodeId) REFERENCES quality.QualityCode(Id),
CONSTRAINT UQ_%s UNIQUE(QualityCodeId, ChannelId, GranulityId, TimeValue)
);
CREATE INDEX IDX_%s_Channel ON %s USING btree(ChannelId);
CREATE INDEX IDX_%s_Quality ON %s USING btree(QualityCodeId);
CREATE INDEX IDX_%s_Granulity ON %s USING btree(GranulityId) WHERE GranulityId > 2;
CREATE INDEX IDX_%s_TimeValue ON %s USING btree(TimeValue);
This definition comes from a FUNCTION
, thus %s
stands for the dataset name.
The UNIQUE
constraint ensure that there must not have duplicate records within a given dataset. A record in this dataset is a value (floatvalue
) for a given channel (channelid
), sampled at given time (timevalue
) on a given interval (granulityid
), with a given quality (qualitycodeid
). Whatever the value is, there cannot have a duplicate of (channelid, timevalue, granulityid, qualitycodeid)
.
Records in dataset look like:
1;25;;1;"2015-01-01 00:00:00";0.54;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
2;25;;1;"2015-01-01 00:30:00";0.49;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
3;25;;1;"2015-01-01 01:00:00";0.47;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
I also have another satellite table where I store significant digit for channels, this parameters can change with time. I store it in the following way:
CREATE TABLE SVPOLFactor (
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
StartTimestamp TIMESTAMP NOT NULL,
Factor FLOAT NOT NULL,
UnitsId VARCHAR(8) NOT NULL,
--
CONSTRAINT PK_SVPOLFactor PRIMARY KEY(Id),
CONSTRAINT FK_SVPOLFactor_Units FOREIGN KEY(UnitsId) REFERENCES Units(Id),
CONSTRAINT UQ_SVPOLFactor UNIQUE(ChannelId, StartTimestamp)
);
When there is a significant digit defined for a channel, a row is added to this table. Then the factor apply since this date. First records always have the sentinel value '-infinity'::TIMESTAMP
which means: the factor applies since the beginning. Next rows must have a real defined value. If there is no row for a given channel, it means significant digit is unitary.
Records in this table look like:
123;277;"-infinity";0.1;"_C"
124;1001;"-infinity";0.01;"-"
125;1001;"2014-03-01 00:00:00";0.1;"-"
126;1001;"2014-06-01 00:00:00";1;"-"
127;1001;"2014-09-01 00:00:00";10;"-"
5001;5181;"-infinity";0.1;"ug/m3"
Goal
My goal is to perform an comparison audit of two datasets that have been populated by distinct processes. To achieve it, I must:
- Compare records between dataset and assess their differences;
- Check if the difference between similar records is enclosed within the significant digit.
For this purpose, I have written the following query which behaves in a manner that I don't understand:
WITH
-- Join records before records (regard to uniqueness constraint) from datastore templated tables in order to make audit comparison:
S0 AS (
SELECT
A.ChannelId
,A.GranulityIdIn AS gidInRef
,B.GranulityIdIn AS gidInAudit
,A.GranulityId AS GranulityId
,A.QualityCodeId
,A.TimeValue
,A.FloatValue AS xRef
,B.FloatValue AS xAudit
,A.StepCount AS scRef
,B.StepCount AS scAudit
,A.DataCount AS dcRef
,B.DataCount AS dcAudit
,round(A.Performance::NUMERIC, 4) AS pRef
,round(B.Performance::NUMERIC, 4) AS pAudit
FROM
datastore.rtu AS A JOIN datastore.audit0 AS B USING(ChannelId, GranulityId, QualityCodeId, TimeValue)
),
-- Join before SVPOL factors in order to determine decimal factor applied to records:
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
),
-- Audit computation:
S2 AS (
SELECT
S1.*
,xaudit - xref AS dx
,(xaudit - xref)/NULLIF(xref, 0) AS rdx
,round(xaudit*pow(10, k))*pow(10, -k) AS xroundfloat
,round(xaudit::NUMERIC, k) AS xroundnum
,0.5*pow(10, -k) AS epsilon
FROM S1
)
SELECT
*
,ABS(dx) AS absdx
,ABS(rdx) AS absrdx
,(xroundfloat - xref) AS dxroundfloat
,(xroundnum - xref) AS dxroundnum
,(ABS(dx) - epsilon) AS dxeps
,(ABS(dx) - epsilon)/epsilon AS rdxeps
,(xroundfloat - xroundnum) AS dfround
FROM
S2
ORDER BY
k DESC
,ABS(rdx) DESC
,ChannelId;
The query may be somewhat unreadable, roughly I expect from it to:
- Join data from two datasets using uniqueness constraint to compare similar records and compute difference (
S0
); - For each difference, find the significant digit (
LEFT JOIN
) that applies for the current timestamps (S1
); - Perform some other useful statistics (
S2
and finalSELECT
).
Problem
When I run the query above, I have missing rows. For example: channelid=123
with granulityid=4
have 12 records in common in both tables (datastore.rtu
and datastore.audit0
). When I perform the whole query and store it in a MATERIALIZED VIEW
, there is less than 12 rows. Then I started investigation to understand why I have missing records and I faced a strange behaviour with WHERE
clause. If I perform a EXPLAIN ANALIZE
of this query, I get:
"Sort (cost=332212.76..332212.77 rows=1 width=232) (actual time=6042.736..6157.235 rows=61692 loops=1)"
" Sort Key: s2.k DESC, (abs(s2.rdx)) DESC, s2.channelid"
" Sort Method: external merge Disk: 10688kB"
" CTE s0"
" -> Merge Join (cost=0.85..332208.25 rows=1 width=84) (actual time=20.408..3894.071 rows=63635 loops=1)"
" Merge Cond: ((a.qualitycodeid = b.qualitycodeid) AND (a.channelid = b.channelid) AND (a.granulityid = b.granulityid) AND (a.timevalue = b.timevalue))"
" -> Index Scan using uq_rtu on rtu a (cost=0.43..289906.29 rows=3101628 width=52) (actual time=0.059..2467.145 rows=3102319 loops=1)"
" -> Index Scan using uq_audit0 on audit0 b (cost=0.42..10305.46 rows=98020 width=52) (actual time=0.049..108.138 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=4.37..4.38 rows=1 width=148) (actual time=4445.865..4509.839 rows=61692 loops=1)"
" -> Sort (cost=4.37..4.38 rows=1 width=148) (actual time=4445.863..4471.002 rows=63635 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 5624kB"
" -> Hash Right Join (cost=0.03..4.36 rows=1 width=148) (actual time=4102.842..4277.641 rows=63635 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.013..0.083 rows=168 loops=1)"
" -> Hash (cost=0.02..0.02 rows=1 width=132) (actual time=4102.002..4102.002 rows=63635 loops=1)"
" Buckets: 65536 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3841kB"
" -> CTE Scan on s0 (cost=0.00..0.02 rows=1 width=132) (actual time=20.413..4038.078 rows=63635 loops=1)"
" CTE s2"
" -> CTE Scan on s1 (cost=0.00..0.07 rows=1 width=168) (actual time=4445.910..4972.832 rows=61692 loops=1)"
" -> CTE Scan on s2 (cost=0.00..0.05 rows=1 width=232) (actual time=4445.934..5312.884 rows=61692 loops=1)"
"Planning time: 1.782 ms"
"Execution time: 6201.148 ms"
And I know that I must have 67106 rows instead.
At the time of writing, I know that S0
returns the correct amount of rows. Therefore the problem must lies in further CTE
.
What I find really strange is that:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1 WHERE Channelid=123 AND GranulityId=4 -- POST-FILTERING
returns 10 rows:
"CTE Scan on s1 (cost=24554.34..24799.39 rows=1 width=196) (actual time=686.211..822.803 rows=10 loops=1)"
" Filter: ((channelid = 123) AND (granulityid = 4))"
" Rows Removed by Filter: 94890"
" CTE s0"
" -> Seq Scan on audit0 (cost=0.00..2603.20 rows=98020 width=160) (actual time=0.009..26.092 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=21215.99..21951.14 rows=9802 width=176) (actual time=590.337..705.070 rows=94900 loops=1)"
" -> Sort (cost=21215.99..21461.04 rows=98020 width=176) (actual time=590.335..665.152 rows=99151 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 12376kB"
" -> Hash Left Join (cost=5.78..4710.74 rows=98020 width=176) (actual time=0.143..346.949 rows=99151 loops=1)"
" Hash Cond: (s0.channelid = sf.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> CTE Scan on s0 (cost=0.00..1960.40 rows=98020 width=160) (actual time=0.012..116.543 rows=98020 loops=1)"
" -> Hash (cost=3.68..3.68 rows=168 width=20) (actual time=0.096..0.096 rows=168 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 12kB"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.045 rows=168 loops=1)"
"Planning time: 0.385 ms"
"Execution time: 846.179 ms"
And the next one returns the correct amount of rows:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
WHERE Channelid=123 AND GranulityId=4 -- PRE FILTERING
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1
Where:
"CTE Scan on s1 (cost=133.62..133.86 rows=12 width=196) (actual time=0.580..0.598 rows=12 loops=1)"
" CTE s0"
" -> Bitmap Heap Scan on audit0 (cost=83.26..128.35 rows=12 width=160) (actual time=0.401..0.423 rows=12 loops=1)"
" Recheck Cond: ((channelid = 123) AND (granulityid = 4))"
" Heap Blocks: exact=12"
" -> BitmapAnd (cost=83.26..83.26 rows=12 width=0) (actual time=0.394..0.394 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_audit0_channel (cost=0.00..11.12 rows=377 width=0) (actual time=0.055..0.055 rows=377 loops=1)"
" Index Cond: (channelid = 123)"
" -> Bitmap Index Scan on idx_audit0_granulity (cost=0.00..71.89 rows=3146 width=0) (actual time=0.331..0.331 rows=3120 loops=1)"
" Index Cond: (granulityid = 4)"
" CTE s1"
" -> Unique (cost=5.19..5.28 rows=12 width=176) (actual time=0.576..0.581 rows=12 loops=1)"
" -> Sort (cost=5.19..5.22 rows=12 width=176) (actual time=0.576..0.576 rows=12 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: quicksort Memory: 20kB"
" -> Hash Right Join (cost=0.39..4.97 rows=12 width=176) (actual time=0.522..0.552 rows=12 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.022 rows=168 loops=1)"
" -> Hash (cost=0.24..0.24 rows=12 width=160) (actual time=0.446..0.446 rows=12 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 6kB"
" -> CTE Scan on s0 (cost=0.00..0.24 rows=12 width=160) (actual time=0.403..0.432 rows=12 loops=1)"
"Planning time: 0.448 ms"
"Execution time: 4.510 ms"
Thus the problem seems to lie in S1
. There is no significant digit defined for channelid = 123
, therefore, those records should not be generated without the LEFT JOIN
. But this does not explain why there are some missing.
Questions
- What am I doing wrong in this query?
I use LEFT JOIN
in order to keep correct cardinality when I fetch significant digits, therefore it must not remove records, after that it is just arithmetic.
- How can pre-filtering returning more rows than post-filtering?
This sounds a little bit buggy to me. If I do not use WHERE
clauses, all records (or combination) are generated (I know that JOIN
is a WHERE
clause) and then computation occurs. When I do not use additional WHERE
(original query), I miss rows (like showed in the examples). When I add a WHERE clause to filter, results are different (which may be fine, if post-filtering has returned more records than pre-filtering).
Any constructive answer that points out my mistakes and miscomprehension of the query are welcome. Thank you.