0

We find the performance issue while using SQLCLR user defined aggregate function(UDA),

Here are our scenario: We have 2 columns needed to calculate: key and value, whose value would like:

key value
Row_1 a/b/c/d/e 1/2/3/2/1
Row_2 a/b/c/d/e 2/0/1/2/3
Row_3 a/b/c/d/e 2/3/4/1/2

We need an aggregate function to get the sum of each metric, in the upper example, we want to get the result like this:

key value
Result a/b/c/d/e 5/5/8/5/6

there is no native aggregate function we can use to get this kind of result, so we use an SQLCLR UDA to achieve this result. We found that the performance of this UDA is bad when SQLCLR UDA used with a GROUP BY clause.

After some investigation, we find that the reason below:

  1. When we use a SQLCLR UDA, a StreamAggregate operation must be used, and an expensive sort operator will be introduced, decreasing the performance of the UDA.
  2. When we use a SQLCLR UDA, only row mode can be used to calculate the results in the sort and aggregate operator.

So, my question:

  1. Is there any chance for user to force SQL Server to use hash aggregator operator instead of stream operator while using user defined aggregate function?
  2. Is there any chance for user to get ride of the sort operator while using user defined aggregate function?
  3. Is there any chance for user to use batch mode while using user defined aggregate function?
Dale K
  • 25,246
  • 15
  • 42
  • 71
othree
  • 1

1 Answers1

2

You will never obtain any performances when the database violate the first normal form... Because this conducts to not having a Relational Database... and a relational engine is especially designed to manipulate quickly relational data but not non relational data.

This is not the problem of UDA performance, but your design which leads to severe counter performances.

First normal form says that a column in a table must ALWAYS have a single (scalar) value. You put a list of values, that violate the first normal form.

Just redesign the database by adding a child table and put the keys and values inside this child table and you will have performance!

You can try this way:

CREATE TABLE T_CHILD
(ID_CHILD       INT IDENTITY PRIMARY KEY,
 ID_ROW         VARCHAR(32), --REFERENCES T_FATHER (ID_ROW),
 KEY_CHILD      VARCHAR(16),
 VALUE_CHILD    INT);
GO

INSERT INTO T_CHILD
SELECT ID_ROW, k.value, V_EMP.value
FROM   T_FATHER
       OUTER APPLY STRING_SPLIT([key], '/') AS k
       OUTER APPLY STRING_SPLIT([value], '/') AS v;
SQLpro
  • 3,994
  • 1
  • 6
  • 14
  • Thanks for your reply and suggestion. We are using columnstore index, so maybe I don't need a child table to do this. But the problem is that if we want to add more keys and values to this table, we will need much more management effert than just store the keys and values in a list(in this way, we just need to add the keys and values to the list). – othree May 11 '21 at 10:23
  • Columnstore indexes has nothing to do with a list of values and will never make a relational database from a non relational approach. You must redesign your tables structure. This is the only ways to do it! – SQLpro May 11 '21 at 10:26
  • You suggestion makes sense, but scalability is also an indicator that we are very concerned about, which is the reason we chose to use a list. – othree May 11 '21 at 10:50
  • There will be limited scalability because you cannot have a SQL datape which has no limits. CHAR/NCHAR are limited to 8000, NCHAR(NVARCHAR are limite to 4000... while number of rows in a table has no limits ! And think about scalability by partitionning that will not be easy with your wrong way... – SQLpro May 11 '21 at 11:42
  • @othree Regarding your concern about scalability: please keep in mind the environment in which you are working. RDBMSs are highly tuned to efficiently process — read , write, sort, compare, join, etc — discrete values, in rows, millions, even billions of rows. String parsing, while it can be done when necessary, is not the primary focus of an RDBMS. If your goal is scalability then you need to play to the strengths of the system you are working in. Hence, SQLpro's suggestion. Modern RDBMSs can scale beyond where many devs would expect, but only when used as an RDBMS, not app code. – Solomon Rutzky May 14 '21 at 23:30
  • @othree and SQLpro: regarding limits of delimited lists, `NVARCHAR(MAX)` would be "effectively" unlimited, but attempting to go that route would be to your detriment. Not only would it mean doing even more string parsing, which is not ideal, but there is a penalty for using a `MAX` type as opposed to `NVARCHAR(4000)`, even if the data is only 20 bytes. And `VARCHAR(MAX)` is not an option with SQLCLR. – Solomon Rutzky May 14 '21 at 23:33
  • Thank you @SQLpro and Solomon, I will take your advice and redesign our DB by split the column to multiple rows. I think a better way is store this two splited columns to different table, and link the origin table and new table with a ID column. Thanks again. – othree May 25 '21 at 23:10