I'm experimenting with some MD5 hashing and select queries and came across some strange behavior I can't explain.
I have foreign tables on an MySQL server with the following structures (using mysql_fdw extension):
CREATE TABLE `ANT_Tags` (
`TmStamp` datetime(6) NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
`RecNum` bigint(20) NOT NULL DEFAULT '0',
`TagLoc` int(11) DEFAULT NULL,
`InStringTag` text COLLATE utf8_unicode_ci,
PRIMARY KEY (`TmStamp`,`RecNum`)
)
CREATE TABLE `CRU_Tags` (
`TmStamp` datetime(6) NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
`RecNum` bigint(20) NOT NULL DEFAULT '0',
`Tag` text COLLATE utf8_unicode_ci,
PRIMARY KEY (`TmStamp`,`RecNum`)
)
The only difference between the tables is one table has tagloc (int) and instringtag (text) and the second just has tag(text). On the Postgres server, I have the table:
CREATE TABLE public.processed_tags
(
id integer NOT NULL DEFAULT nextval('processed_tags_id_seq'::regclass),
insert_dt timestamp without time zone NOT NULL DEFAULT now(),
lndb_table character varying(50) COLLATE pg_catalog."default",
md5_hash character varying(32) COLLATE pg_catalog."default",
CONSTRAINT processed_tags_pkey PRIMARY KEY (id)
)
I am using a CTE statement to query and process records from these tables (grab information, write it to another table and log what has been processed via the hash) when I was having some performance issues. Doing some investigating, I ran into a strange issue with using not in versus not exists. I've narrowed it down to the select statement.
A little background: I was referring to this stackoverflow for performance tuning queries: PostgreSQL query is slow when using NOT IN
Here is the strange behavior I am hoping someone can explain. This query ran on the first table above, runs forever (well not sure how long, I canceled it after 10-15 minutes):
select
now(),
'ant_tags' as lndb_table,
MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags')) as md5_hash,
tmstamp,
recnum,
tagloc,
instringtag
from ant_tags
where tmstamp > '2009-01-01' and tmstamp <= now()
and MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags'))
not in (select md5_hash from processed_tags where lndb_table = 'ant_tags')
Using the hints from the stackoverflow article above, I changed it to:
select
now(),
'ant_tags' as lndb_table,
MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags')) as md5_hash,
tmstamp,
recnum,
instringtag
from ant_tags
and not exists (
select 1
from processed_tags
where lndb_table = 'ant_tags' and md5_hash = MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags'))
)
This query runs fast and returns within seconds.
The strangeness is if I run the same queries on the other table, replacing tag for tagloc and instringtag in the select columns, I have the exact opposite issue. The not exists query takes forever and the not in query returns in seconds.
select
...
tag
versus
select
...
tagloc,
instringtag
I have other tables with the same structures where I am seeing the same results. The tables with tagloc and instringtag work well using not exists but not with not in and the second table structure works well with not in and is slow with not exists.
Notes: There are approximately 100K row being selected/processed.
example instringtag: 01 03 01/02/2018 23:11:51.200 3E7.0000001D03
example tag: TAG: 01 03 01/02/2018 23:11:51.200 3E7.0000001D03
(the only real difference being the "TAG:")