0

I'm using SQL server 2008r2. I have a problem of returning data to the user because of massive joins (for example I need to make 5 inner + 6 left joins in one query (usually tvfs, sometimes tables). It takes toooo long.)

What are the workarounds for this problem? Should I denormolize my database? What are the best practices to avoid huge number of joins?

user194076
  • 8,787
  • 23
  • 94
  • 154
  • 1
    You've not given us enough information to help you here. Truth be told, you may not be *able* to. But you could at least start with your schema and the query in question. – Andrew Barber Feb 26 '12 at 03:40
  • A word of caution when using table valued functions in joins- if you are using multi-line functions you can run into major performance issues because the optimizer is not able to use statistics for the tables queried in the function: http://stackoverflow.com/questions/2554333/multi-statement-table-valued-function-vs-inline-table-valued-function – J Cooper Feb 26 '12 at 04:18
  • Have you indexed the columns which are being joined to? That's important for performance. – StilesCrisis Feb 26 '12 at 03:34
  • I do have indexes both clustered and non-clustered. – user194076 Feb 26 '12 at 03:35
  • I just read about "columnstore indexes" and I do not have any of those. not sure if they would help – user194076 Feb 26 '12 at 03:36

2 Answers2

1

I'd have to see the SQL to troubleshoot specifics, but here's a few things I do when pulling results that have extremely high demand:

  1. Use you tools. Display Estimated Execution Plan can expose some obvious vagaries in your logic.

  2. Learn to love 'where exists' and 'having'. You can minimize the focus and scope sometimes by qualifying in creative ways that don't require HARD IO. This is more true for sub-queries than joins but I add a clause for every outer join I need.

  3. Most importantly IMO, don't be afraid of staging your results. You sometimes need to process billions/trillions of transactions against millions of records and what takes hours with joins can be accomplished in minutes or seconds by staging. If you only need x% of you top 2 or 3 tables, why join every record top to bottom? Sometimes it's just too much overhead. Pull your simplest result-set down to a stage table (or temp, whatever you need), index it and then go after the next chunk. That usually saves me a fortune in memory.

  4. Use CTEs when you can. However, my experience has been they degrade beyond a certain point. Nice for ancillary tables but not for serious volume.

Be creative in your combinations. I'll use those exists clauses in Stage 1 (reading Tables a, b and c) to only bring back the records that also exist in tables d, e and f.

A lot of the expert SQL advice is not based on VLDBs - it's based on Customer, Orders, Demographic type schemas.

Are these stored procs run natively?

user1222760
  • 79
  • 2
  • 10
  • Regarding #3 - a join doesn't mean that the database will actually match all records from both tables. It depends on the execution plan. A query joining 2 tables with each having millions of records could be as cheap as 10 or so logical reads. – J Cooper Feb 26 '12 at 05:01
  • Using a CTE will not affect performance. SQL server expands the CTE when it processes the query. – J Cooper Feb 26 '12 at 05:10
  • This is true if you're indexes are precise across all joins and then it'll depend on how your primary table is qualified, whether or not you have non-qualified criteria or fields, outer joins, etc. All bets are off so to speak and you're at the mercy of your plan... and it's not dynamic. With VLDBs, it's a good rule to never do more than one 'inquiry' at a time - meaning if you have to read a record to see if it qualifies, a subsequent 'inquiry' compounds your overhead. – user1222760 Feb 26 '12 at 05:33
  • Can you please elaborate a little bit more on #3? How to achieve this functionality? Just create some staging table, that will already have joined results, so when user asks for a result, it's already going to be there? – user194076 Feb 26 '12 at 14:09
  • It's really tough not knowing the specifics of your needs or the situation... but you have options. I'm not a fan of storing redundant data but some organizations do it. IMO it's generally bad practice unless you have serious hardware limitations. I'm more referring to splitting up the workload so you're not indexing / reading rows you're not sure if you need. Get your data you KNOW you need in Step 1. That means you have well defined criteria and indexes. From that result-set, go after your more loosely defined data. Outer joins, non-indexed criteria that has to be qualified, etc.,. – user1222760 Feb 26 '12 at 18:41
  • Volume is a factor as well. For example: If I needed all the green houses facing north with Spanish-speaking residents in the universe, I wouldn't join galaxy, solar_system, planet, country, territory, region, city, etc.,. (think Gene Roddenberry's universe) That could be trillions of connections. I might get planets where exists language = 'SPANISH' first and then bring in correlated data. That's a stretch but you get the idea. If you could post your logic, you'll probably get many good ideas on how to make it run in a flash. – user1222760 Feb 26 '12 at 18:55
0

Here's a good (over-simplified) example of staging:

Let's say you wanted to find all of the high-risk individuals in your city (Might as well be interesting about it). You have a Phone company dB (national) indexed by state, city, last name, first name, address and an FBI dB (global) indexed by last name, first name, country, region, address. Let's say the FBI dB has multiple records for each individual due to multiple past addresses.

You could join the two dBs on the common elements and then qualify your criteria. Or... Select RecordID from Phone as P1 Where State = 'MyState' and City = 'MyCity' and exists (Select 1 From TheMan as M1 Where M1.Last = P1.Last and M1.First = P1.First and M1.Risk > 80)

Now I have a small record-set to qualify and a small result-set to work from. From there I can go get details. That's a good candidate for a CTE and I could shoot a dozen holes in the logic, but it illustrates the concept. If you bring M1.Risk (non-indexed field) into the equation with a full join, you're forcing SQL Server to plan against it in certain situations. Not necessarily here, but as your logic gets more complex and subsequent non-indexed criteria comes into play.

user1222760
  • 79
  • 2
  • 10