SQL Assemblies vs Application code for complicated queries on large XML columns

Question

I have a table with a few relational columns and one XML column which sometimes holds a fairly large chunk of data. I also have a simple webservice which uses the database. I need to be able to report on things like all the instances of a certain element within the XML column, a list of all the distinct values for a certain element, things like that.

I was able to get a list of all the distinct values for an element, but didn't get much further than that. I ended up writing incredibly complex T-SQL code to do something that seems pretty simple in C#: go through all the rows in this table, and apply this ( XPath | XQuery | XSLT ) to the XML column. I can filter on the relational columns to reduce the amount of data, but this is still a lot of data for some of the queries.

My plan was to embed an assembly in SQL Server (I'm using 2008 SP2) and have it create an indexed view on the fly for a given query (I'd have other logic to clean this view up). This would allow me to keep the network traffic down, and possibly also allow me to use tools like Excel and MSRS reports as a cheap user interface, but I'm seeing a lot of people saying "just use application logic rather than SQL assemblies". (I could be barking entirely up the wrong tree here, I guess).

Grabbing the big chunk of data to the web service and doing the processing there would have benefits as well - I'm less constrained by the SQL Server environment (since I don't live inside it) and my setup process is easier. But it does mean I'm bringing a lot of data over the network, storing it in memory while I process it, then throwing some of it away.

Any advice here would be appreciated.

Thanks

Edit:

Thanks guys, you've all been a big help. The issue was that we were generating a row in the table for a file, and each file could have multiple results, and we would doing this each time we ran a particular build job. I wanted to flatten this out into a table view.

Each execution of this build job checked thousands of files for several attributes, and in some cases each of these tests these were generating thousands of results (MSIVAL tests were the worst culprit).

The answer (duh!) is to flatten it out before it goes into the database! Based on your feedback, I decided to try creating a row for each result for each test on each file, and the XML just had the details of that one result - this made the query much simpler. Of course, we now have hundreds of thousands of rows each time we run this tool but the performance is much better. I now have a view which creates a flattened version of one of the classes of results that are emitted by the build job - this returns >200,000 and takes <5 seconds, compared to around 3 minutes for the equivalent (complicated) query before I went the flatter route, and between 10 and 30 minutes for the XML file processing of the old (non-database) version.

I now have some issues with the number of times I connect, but I have an idea of how to fix that.

Thanks again! +1's all round

Aren't the standard xml tools in TSQL sufficient? http://msdn.microsoft.com/en-us/library/ms189075.aspx — Filip De Vos, May 26 '11 at 19:52
In theory but I was having real trouble since most of those examples either deal with turning relational data into XML (the opposite of what I want to do!) or with only 1 record — JohnL, May 27 '11 at 08:31
It think you found a good solution as it sounds more "relational" now. — Filip De Vos, May 27 '11 at 09:37

score 2 · Accepted Answer · answered May 26 '11 at 19:58

2

I suggest using the standard xml tools in TSQL. (http://msdn.microsoft.com/en-us/library/ms189075.aspx). If you don't wish to use this I would recommend processing the xml on another machine. SQLCLR is perfect for smaller functions, but with the restrictions on the usable methods it tends to become an exercise in frustration once you are trying to do more advanced things.

answered May 26 '11 at 19:58

Filip De Vos

11,568
1
48
60

Hmmm, "more advanced things" was what I need to use it for, so this a definite downpoint for SQLCLR (at least in this use case) – JohnL May 27 '11 at 08:34

score 1 · Answer 2 · answered May 26 '11 at 20:56

What you're asking about is really a huge balancing act and it totally depends on several factors. First, what's the current load on your database? If you're running this on a database that is already under heavy load, you're probably going to want to do this parsing on the web service. XML shredding and querying is an incredibly expensive procedure in SQL Server, especially if you're doing it on un-indexed columns that don't have a schema defined for them. Schemas and indexes help with this processing overhead, but they can't eliminate the fact that XML parsing isn't cheap. Secondly, the amount of data you're working with. It's entirely possible that you just have too much data to push over the network. Depending on the location of your servers and the amount of data, you could face insurmountable problems here.

Finally, what are the relative specs of your machines? If your web service machine has low memory, it's going to be thrashing data in and out of virtual memory trying to parse the XML which will destroy your performance. Maybe you're not running the most powerful database hardware and shredding XML is going to be performance prohibitive for the CPU you've got on your database machine.

At the end of the day, the only way to really know is to try both ways and figure out what makes sense for you. Doing the development on your web services machine will almost undoubtedly be easier as LINQ to XML is a more elegant way of parsing through XML than XQuery shoehorned into T-SQL is. My indication, given the information you provided in your question, is that T-SQL is going to perform better for you in the long run because you're doing XML parsing on every row or at least most rows in the database for reporting purposes. Pushing that kind of information over the network is just ugly. That said, if performance isn't that important, there's something to be said about taking the easier and more maintainable route of doing all the parsing on the application server.

This is a brand new database (some of the XML files produced by our build system are quite large, so I decided to migrate this to a database for performance reasons), and it's not decided yet where it will live. — JohnL, May 27 '11 at 08:36

SQL Assemblies vs Application code for complicated queries on large XML columns

Edit:

2 Answers2