Strange behavior using not exists and not in queries with postgres and mysql foreign data wrapper

Question

I'm experimenting with some MD5 hashing and select queries and came across some strange behavior I can't explain.

I have foreign tables on an MySQL server with the following structures (using mysql_fdw extension):

CREATE TABLE `ANT_Tags` (
  `TmStamp` datetime(6) NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
  `RecNum` bigint(20) NOT NULL DEFAULT '0',
  `TagLoc` int(11) DEFAULT NULL,
  `InStringTag` text COLLATE utf8_unicode_ci,
  PRIMARY KEY (`TmStamp`,`RecNum`)
)

CREATE TABLE `CRU_Tags` (
  `TmStamp` datetime(6) NOT NULL DEFAULT '0000-00-00 00:00:00.000000',
  `RecNum` bigint(20) NOT NULL DEFAULT '0',
  `Tag` text COLLATE utf8_unicode_ci,
  PRIMARY KEY (`TmStamp`,`RecNum`)
)

The only difference between the tables is one table has tagloc (int) and instringtag (text) and the second just has tag(text). On the Postgres server, I have the table:

CREATE TABLE public.processed_tags
(
    id integer NOT NULL DEFAULT nextval('processed_tags_id_seq'::regclass),
    insert_dt timestamp without time zone NOT NULL DEFAULT now(),
    lndb_table character varying(50) COLLATE pg_catalog."default",
    md5_hash character varying(32) COLLATE pg_catalog."default",
    CONSTRAINT processed_tags_pkey PRIMARY KEY (id)
)

I am using a CTE statement to query and process records from these tables (grab information, write it to another table and log what has been processed via the hash) when I was having some performance issues. Doing some investigating, I ran into a strange issue with using not in versus not exists. I've narrowed it down to the select statement.

A little background: I was referring to this stackoverflow for performance tuning queries: PostgreSQL query is slow when using NOT IN

Here is the strange behavior I am hoping someone can explain. This query ran on the first table above, runs forever (well not sure how long, I canceled it after 10-15 minutes):

        select 
            now(),
            'ant_tags' as lndb_table, 
            MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags')) as md5_hash, 
            tmstamp, 
            recnum, 
            tagloc,
            instringtag
        from ant_tags 
        where tmstamp > '2009-01-01' and tmstamp <= now()
        and MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags'))
        not in (select md5_hash from processed_tags where lndb_table = 'ant_tags')

Using the hints from the stackoverflow article above, I changed it to:

            select 
                now(),
                'ant_tags' as lndb_table, 
                MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags')) as md5_hash, 
                tmstamp, 
                recnum, 
                instringtag
            from ant_tags 
                and not exists (
                select 1
                from processed_tags
                where lndb_table = 'ant_tags' and md5_hash = MD5(MD5(to_char(tmstamp, 'MM-DD-YYYY HH24:MI:SS')) || MD5(recnum::text) || MD5('ant_tags'))
)

This query runs fast and returns within seconds.

The strangeness is if I run the same queries on the other table, replacing tag for tagloc and instringtag in the select columns, I have the exact opposite issue. The not exists query takes forever and the not in query returns in seconds.

select 
    ...
    tag

versus

select 
    ...
    tagloc,
    instringtag

I have other tables with the same structures where I am seeing the same results. The tables with tagloc and instringtag work well using not exists but not with not in and the second table structure works well with not in and is slow with not exists.

Notes: There are approximately 100K row being selected/processed.

example instringtag: 01 03 01/02/2018 23:11:51.200 3E7.0000001D03

example tag: TAG: 01 03 01/02/2018 23:11:51.200 3E7.0000001D03

(the only real difference being the "TAG:")

How many of each type of thing are there? A huge list takes a huge amount of time, short lists are faster. I would advise against using `in` and `not in` around query results - use joins instead. — Bohemian, Jan 25 '18 at 22:11
@Bohemian: about 100K rows per each. Joins may work, but my real question is why is there a difference between not in and not exists for two almost identical tables. — bah, Jan 25 '18 at 23:11

Strange behavior using not exists and not in queries with postgres and mysql foreign data wrapper

0 Answers0