0

I have a table baseTable with columns id and dateCol. I want to create several tables where I group by a function of dateCol.

My first attempt at this was

CREATE derivedTable ...

INSERT INTO derivedTable
SELECT id, myFunc(dateCol) AS datePrime, ...
FROM baseTable
GROUP BY id, datePrime

This worked! But it's pretty slow. My guess is that it's slow because there's no index on the datePrime column. I thought I'd just add a quick index on datePrime to validate this theory, but this turned out to be much harder than I was expecting.

My first instinct was to add a datePrime column to baseTable with

ALTER TABLE baseTable ADD COLUMN datePrime INT GENERATED ALWAYS AS (myFunc(dateCol));

Then I could add an index on that new column:

CREATE INDEX idx on baseTable(datePrime);

and finally, (hopefully) make derivedTable quicker:

CREATE derivedTable ...

INSERT INTO derivedTable
SELECT id, datePrime, ...
FROM baseTable
GROUP BY id, datePrime

However, MySQL doesn't allow you to use stored functions to derive a column (link):

Generated column expressions must adhere to the following rules... Stored functions and loadable functions are not permitted.

The only thing I can think of right now to create a new table baseTablePrime which contains all the columns from baseTable plus a new datePrime column created as

CREATE TABLE baseTablePrime AS SELECT id, dateCol, myFunc(dateCol), ... FROM baseTable

I think this would work, but then I have baseTable and a slightly augmented baseTablePrime, which seems wasteful and inelegant.

Before charging ahead with what seems like at best a hack, I'm asking for help on how I should be approaching this problem. I feel like I may be attempting an SQL anti-pattern somehow. Let's take as a given that I would like to create an index on a stored column which is derived from another column using a user-defined function. What's the cleanest way to do something like this?

crockeea
  • 21,651
  • 10
  • 48
  • 101
  • The slowness is probably because it's scanning all your table's rows. Even if you could create an index on the datePrime column, I don't think it would improve the performance of a GROUP BY by much. It would only turn the full table-scan into full index-scan (i.e. examining 100% of the entries in an index). – Bill Karwin Sep 09 '22 at 04:06
  • It's not clear why you are using GROUP BY at all, since you include the `id` column as a grouping column, and I assume `id` is the primary key and therefore is unique. Unless your example is simplified from your real query. – Bill Karwin Sep 09 '22 at 04:07
  • It's a simplified example; id is actually not unique/a primary key. – crockeea Sep 09 '22 at 04:13
  • 1
    So you're asking for help optimizing a query, but you haven't shared the _actual_ query, and you also haven't share the table definition? – Bill Karwin Sep 09 '22 at 04:20
  • What does your myfunc look like? If it is a simple function, you can just use the term itself as the generator (e.g. if myfunc returns 2*a , you can use `as (2*col)` instead of `as (myfunc(col)`). Also: do you need this often or is it e.g. a way to make a one-time report/analysis/... a bit faster? In this case, using the baseTablePrime-copy (or just accepting that it takes some time) is probably better than modifying the original table. And if it's a regular thing, you should probably still not do it that way, as you are denormalizing your data (but for alternatives, we'd need more details). – Solarflare Sep 09 '22 at 11:51
  • @Solarflare It's long, but it is composed of deterministic bits: `((MONTH(`OBSERVATION DATE`)-1)*4 + CEILING(4 * CAST(DAY(`OBSERVATION DATE`) as FLOAT) / CAST(DAY(LAST_DAY(`OBSERVATION DATE`)) as FLOAT)))` I hadn't considered inlining it into an `ALTER TABLE` statement as a result. But that _does_ work! The question, however, is _is it idiomatic_? If this were a standard programming language, I would always factor something this large and self-contained out into its own function, which was my instinct here. – crockeea Sep 09 '22 at 14:19
  • I get the impression you are overthinking this. While I'd say, yes, describing this generated function this way is idiomatic in this specific version of MySQL, this seems to be the only part of your whole task that's idiomatic: Inserting derived data into another table is basically the opposite of idiomatic (it contradicts "normalization"). Changing the data structure to speed up a query (if it's not a permanently useful value)? Not sure if it's more idiomatic than waiting longer - if it's even faster, since we do not know what "id" stands for or if maybe "group by id" could be optimized... – Solarflare Sep 09 '22 at 15:22

0 Answers0