0

I have a table with many rows and I'd like to speed up my query. This is the original query:

cursor.execute("""select id, data
                        from events
                        where processed IS NULL and instr(data, 'Captain') = 0;""")

I thought I could speed things up by first selecting only the unprocessed rows and then searching the text of the data variable. But this is also slow.

    cursor.execute("""select x.id, x.data
                    from
                    (select a.id, a.data from events as a where a.processed IS NULL) as x
                    where instr(x.data, 'Captain') = 0;""")

Any thoughts on how I could filter the processed column for NULL before searching through data?

myname
  • 1,337
  • 2
  • 11
  • 17
  • Another route you could use is to add an index to the data column, and switch `instr` to `like`. That would make filtering on the data column faster. See also: https://stackoverflow.com/questions/2451608/which-is-faster-instr-or-like/22171511 – Nick ODell Dec 02 '21 at 18:42
  • *But this is also slow* Moreover, is is slower than initial variant - simply you cannot detect this difference. – Akina Dec 02 '21 at 18:46
  • @juergend OP needs in rows where `'Captain'` substring is NOT present. – Akina Dec 02 '21 at 18:47
  • If the table contains many columns then the index by `(processed, data, id)` will help. If the percent of the rows with `processed IS NULL` is low (5% or less) then the index by `(processed)` will help (but it is possible that you'd force its usage). In another cases I do not see the way to improvement. – Akina Dec 02 '21 at 18:50

1 Answers1

0

Perhaps instr(data, 'Captain') could be changed to data NOT LIKE '%Captain%'? Either formulation has to read every row, so I doubt if there would be a speed difference.

Also consider FULLTEXT(data) and NOT MATCH(data) AGAINST('+Captain' IN BOOLEAN MODE). FULLTEXT may not help because of the NOT.

As for your thought on a subquery, I think it won't help. After all that is essentially what is going on.

What might matter is the indexes you have and the cardinality of them. If processed is rarely NULL (say, less than 10% of the rows), then INDEX(processed) will speed things up.

Rick James
  • 135,179
  • 13
  • 127
  • 222