SQL: Inner joining two massive tables

Question

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:

BioEntity table:

BioEntityId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)

EGM table (an auxiliar table, in fact, resulting of bulk import operations):

EMGId (int)
PId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
LastModified (date)

I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.

I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.

Now, here's the query:

 SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
 FROM EGM INNER JOIN BioEntity 
 ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

It has been running for 6 days now and I'm still not sure what to do... Any ideas guys? How can I make this faster (or, at least, finite)?

Update: - Ok, I've canceled the query and rebooted the server to get the OS up and running again - I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on

Update 2 (1PM GMT time, 18/11/09): - The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.

Just curious... why do you need the 100+ million rows back and what are you going to do with all this data?? — Otávio Décio, Nov 17 '09 at 16:22
What is the largest value stored in your 4k name field? If it is substantially less than 4k then reduce the size in each table. — Alistair Knock, Nov 17 '09 at 16:23
It should be really much smaller than 4k, just thought that since it's an Nvarchar field it wouldn't really matter... I think I can probably cut it down to less than 200 chars, maybe. — João Pereira, Nov 17 '09 at 16:24
@Hal - yes, I can see that, but there is no way in my mind to have any program being able to swallow that amount of data and survive. There must be a missing "where" clause. — Otávio Décio, Nov 17 '09 at 16:24
It's really supposed to be a huge amount of data; the "where" clause is applied by enforcing restriction regarding the nature of the data itself (defined in the TypeId field). — João Pereira, Nov 17 '09 at 16:26
I would respectfully disagree about 3G as "decent" for a server tasked as this. Re: disk; consider that every join starts as cross-join and is "filtered" on ON, later on WHERE and then on SELECT columns. — Damir Sudarevic, Nov 17 '09 at 17:04
I'm afraid I'm stuck to this hardware configuration for now :/ — João Pereira, Nov 18 '09 at 15:06

score 19 · Answer 1 · answered Nov 17 '09 at 16:26

19

I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.

You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.

answered Nov 17 '09 at 16:26

Larry Lustig

49,320
14
110
160

Checksum would work but, depending on the nature of the data in NAME, you might be able to use a faster hashing algorithm (perhaps the NAME tends to be unique in the first ten chars, or something like that). – Larry Lustig Nov 17 '09 at 22:32
If Name gets set and never changed, the hash would only ever have to be calculated once, so hash calculation speed might not be so relevant. – Philip Kelley Nov 19 '09 at 14:50
Maybe so, and maybe there is no reasonable optimization to be made to the hash algorithm. Starting with CHECKSUM is a good, easy solutions. Still, over a hundred million records. . . – Larry Lustig Nov 19 '09 at 15:50

Andomar · Accepted Answer · 2009-11-17T21:33:42.257

12

For huge joins, sometimes explicitly choosing a loop join speeds things up:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER LOOP JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId

As always, posting your estimated execution plan could help us provide better answers.

EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM 
INNER JOIN BioEntity 
    ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)

edited Nov 17 '09 at 21:33

answered Nov 17 '09 at 16:25

Andomar

232,371
49
380
404

2

I'm cancelling the query now, let's see if SQL Server can come back from the dead and give us the plan... – João Pereira Nov 17 '09 at 16:29
Ok, server died, rebooted, redid the workflow during the night; posting the results now – João Pereira Nov 18 '09 at 12:48
`inner loop join` use less memory and more cpu? – seyed Feb 03 '13 at 20:38

RickNZ · Answer 3 · 2009-11-19T14:55:21.413

First, 100M-row joins are not at all unreasonable or uncommon.

However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.

One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).

score 6 · Answer 4 · answered Nov 17 '09 at 22:20

6

Maybe a bit offtopic, but: " I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."

This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.

More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.

answered Nov 17 '09 at 22:20

Arvo

10,349
1
31
34

so this got upvoted for posting the same thing i already asked? – DForck42 Nov 18 '09 at 14:42
Thirster42, I didn't see your answer when I posted (otherwise I would refer to your posting). And what's problem with upvoting? Did you lost somethig? – Arvo Nov 18 '09 at 22:27

score 4 · Answer 5 · answered Nov 17 '09 at 16:26

4

I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.

answered Nov 17 '09 at 16:26

Jim B

8,344
10
49
77

Not really no, I also tried with the equals characters ("="), but it did not look promising anyway. I'll swap it, thanks! – João Pereira Nov 17 '09 at 16:29
5

Without wildcards, the LIKE should optimize into an "=" anyway. – Larry Lustig Nov 17 '09 at 16:43

score 3 · Answer 6 · answered Nov 17 '09 at 16:44

As recommended, I would hash the name to make the join more reasonable. I would strongly consider investigating assigning the id during the import of batches through a lookup if it is possible, since this would eliminate the need to do the join later (and potentially repeatedly having to perform such an inefficient join).

I see you have this index on the TypeID - this would help immensely if this is at all selective. In addition, add the column with the hash of the name to the same index:

SELECT EGM.Name
       ,BioEntity.BioEntityId
INTO AUX 
FROM EGM 
INNER JOIN BioEntity  
    ON EGM.TypeId = BioEntity.TypeId -- Hopefully a good index
    AND EGM.NameHash = BioEntity.NameHash -- Should be a very selective index now
    AND EGM.name LIKE BioEntity.Name

I'll be trying this further along the road, I need to explore the estimation plan now. Thanks :) — João Pereira, Nov 18 '09 at 12:57

score 2 · Answer 7 · answered Nov 17 '09 at 16:54

2

Another suggestion I might offer is try to get a subset of the data instead of processing all 100 M rows at once to tune your query. This way you don't have to spend so much time waiting to see when your query is going to finish. Then you could consider inspecting the query execution plan which may also provide some insight to the problem at hand.

answered Nov 17 '09 at 16:54

Wil P

3,341
1
20
20

1

this and having the exactly right, smallest, indexes possible (possibly another pre-processing step) are keys to tractability. – Don Nov 17 '09 at 17:05

score 1 · Answer 8 · answered Nov 17 '09 at 16:26

1

100 million records is HUGE. I'd say to work with a database that large you'd require a dedicated test server. Using the same machine to do other work while performing queries like that is not practical.

Your hardware is fairly capable, but for joins that big to perform decently you'd need even more power. A quad-core system with 8GB would be a good start. Beyond that you have to make sure your indexes are setup just right.

answered Nov 17 '09 at 16:26

Dave Swersky

34,502
9
78
118

1

LOL yeah tell him StackOverflow says you need a new AlienWare laptop too! – Dave Swersky Nov 17 '09 at 17:17
And two 30 inch monitors. That's a lot of data to look at – Nathan Feger Nov 17 '09 at 22:39

score 1 · Answer 9 · answered Nov 17 '09 at 21:39

1

do you have any primary keys or indexes? can you select it in stages? i.e. where name like 'A%', where name like 'B%', etc.

answered Nov 17 '09 at 21:39

DForck42

19,789
13
59
84

I have PK's (EMGId and BioEntityId) and the indexes are posted in the question – João Pereira Nov 18 '09 at 12:50

score 1 · Answer 10 · answered Nov 17 '09 at 21:46

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).

You said you made a clustered index on TypeId in both tables, although it appears you have a primary key on each table already (BioEntityId & EGMId, respectively). You do not want your TypeId to be the clustered index on those tables. You want the BioEntityId & EGMId to be clustered (that will physically sort your data in order of the clustered index on disk. You want non-clustered indexes on foreign keys you will be using for lookups. I.e. TypeId. Try making the primary keys clustered, and adding a non-clustered index on both tables that ONLY CONTAINS TypeId.

In our environment we have a tables that are roughly 10-20 million records apiece. We do a lot of queries similar to yours, where we are combining two datasets on one or two columns. Adding an index for each foreign key should help out a lot with your performance.

Please keep in mind that with 100 million records, those indexes are going to require a lot of disk space. However, it seems like performance is key here, so it should be worth it.

K. Scott has a pretty good article here which explains some issues more in depth.

I know. I did that, but the results were not really what I expected. I gave it a go because SQL Server Database Tuning Advisor suggested it; still think it's stupid — João Pereira, Nov 18 '09 at 12:43

score 1 · Answer 11 · answered Nov 17 '09 at 22:04

Reiterating a few prior posts here (which I'll vote up)...

How selective is TypeId? If you only have 5, 10, or even 100 distinct values across your 100M+ rows, the index does nothing for you -- particularly since you're selecting all the rows anyway.

I'd suggest creating a column on CHECKSUM(Name) in both tables seems good. Perhaps make this a persisted computed column:

CREATE TABLE BioEntity
 (
   BioEntityId  int
  ,Name         nvarchar(4000)
  ,TypeId       int
  ,NameLookup  AS checksum(Name) persisted
 )

and then create an index like so (I'd use clustered, but even nonclustered would help):

CREATE clustered INDEX IX_BioEntity__Lookup on BioEntity (NameLookup, TypeId)

(Check BOL, there are rules and limitations on building indexes on computed columns that may apply to your environment.)

Done on both tables, this should provide a very selective index to support your query if it's revised like this:

SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
 FROM EGM INNER JOIN BioEntity 
 ON EGM.NameLookup = BioEntity.NameLookup
  and EGM.name = BioEntity.Name
  and EGM.TypeId = BioEntity.TypeId

Depending on many factors it will still run long (not least because you're copying how much data into a new table?) but this should take less than days.

Yup, only a hundred or so items in TypeId. Thanks for the input, i'll give a try soon... my 2 year old MBP just died on me, i'm off to the store :/ — João Pereira, Nov 18 '09 at 15:02

score 1 · Answer 12 · answered Nov 17 '09 at 22:31

Why an nvarchar? Best practice is, if you don't NEED (or expect to need) the unicode support, just use varchar. If you think the longest name is under 200 characters, I'd make that column a varchar(255). I can see scenarios where the hashing that has been recommended to you would be costly (it seems like this database is insert intensive). With that much size, however, and the frequency and random nature of the names, your indexes will become fragmented quickly in most scenarios where you index on a hash (dependent on the hash) or the name.

I would alter the name column as described above and make the clustered index TypeId, EGMId/BioentityId (the surrogate key for either table). Then you can join nicely on TypeId, and the "rough" join on Name will have less to loop through. To see how long this query might run, try it for a very small subset of your TypeIds, and that should give you an estimate of the run time (although it might ignore factors like cache size, memory size, hard disk transfer rates).

Edit: if this is an ongoing process, you should enforce the foreign key constraint between your two tables for future imports/dumps. If it's not ongoing, the hashing is probably your best best.

Can't be sure of that although it will probably suffice – João Pereira Nov 18 '09 at 12:49 — João Pereira, Nov 18 '09 at 12:49

score 1 · Answer 13 · answered Nov 19 '09 at 03:46

I would try to solve the issue outside the box, maybe there is some other algorithm that could do the job much better and faster than the database. Of course it all depends on the nature of the data but there are some string search algorithm that are pretty fast (Boyer-Moore, ZBox etc), or other datamining algorithm (MapReduce ?) By carefully crafting the data export it could be possible to bend the problem to fit a more elegant and faster solution. Also, it could be possible to better parallelize the problem and with a simple client make use of the idle cycles of the systems around you, there are framework that can help with this.

the output of this could be a list of refid tuples that you could use to fetch the complete data from the database much faster.

This does not prevent you from experimenting with index, but if you have to wait 6 days for the results I think that justifies resources spent exploring other possible options.

my 2 cent

score 0 · Answer 14 · answered Nov 17 '09 at 17:08

Since you're not asking the DB to do any fancy relational operations, you could easily script this. Instead of killing the DB with a massive yet simple query, try exporting the two tables (can you get offline copies from the backups?).

Once you have the tables exported, write a script to perform this simple join for you. It'll take about the same amount of time to execute, but won't kill the DB.

Due to the size of the data and length of time the query takes to run, you won't be doing this very often, so an offline batch process makes sense.

For the script, you'll want to index the larger dataset, then iterate through the smaller dataset and do lookups into the large dataset index. It'll be O(n*m) to run.

score 0 · Answer 15 · edited Jul 07 '10 at 14:40

I wonder, whether the execution time is taken by the join or by the data transfer.

Assumed, the average data size in your Name column is 150 chars, you will actually have 300 bytes plus the other columns per record. Multiply this by 100 million records and you get about 30GB of data to transfer to your client. Do you run the client remote or on the server itself ? Maybe you wait for 30GB of data being transferred to your client...

EDIT: Ok, i see you are inserting into Aux table. What is the setting of the recovery model of the database?

To investigate the bottleneck on the hardware side, it might be interesting whether the limiting resource is reading data or writing data. You can start a run of the windows performance monitor and capture the length of the queues for reading and writing of your disks for example.

Ideal, you should place the db log file, the input tables and the output table on separate physical volumes to increase speed.

Recovery model is set to simple; learned that the hard way :) Simple yet logical suggestion regarding the separate physical volumes, I'm only using a single HDD. Thanks! I'm posting the execution plan estimate now by the way — João Pereira, Nov 18 '09 at 12:48

score 0 · Answer 16 · answered Nov 19 '09 at 02:50

If the hash match consumes too many resources, then do your query in batches of, say, 10000 rows at a time, "walking" the TypeID column. You didn't say the selectivity of TypeID, but presumably it is selective enough to be able to do batches this small and completely cover one or more TypeIDs at a time. You're also looking for loop joins in your batches, so if you still get hash joins then either force loop joins or reduce the batch size.

Using batches will also, in simple recovery mode, keep your tran log from growing very large. Even in simple recovery mode, a huge join like you are doing will consume loads of space because it has to keep the entire transaction open, whereas when doing batches it can reuse the log file for each batch, limiting its size to the largest needed for one batch operation.

If you truly need to join on Name, then you might consider some helper tables that convert names into IDs, basically repairing the denormalized design temporarily (if you can't repair it permanently).

The idea about checksum can be good, too, but I haven't played with that very much, myself.

In any case, such a huge hash match is not going to perform as well as batched loop joins. If you could get a merge join it would be awesome...

SQL: Inner joining two massive tables

16 Answers16