Index on a table you must fully scan anyway? (MySQL)

Question

I'm pretty stumped here.

I have 2 tables and I'm left joining the first (around 500k records) with the second (around 2.2 million records) in order to find out which records are in the first and not the second. (typical "b.attribute is null" nonsense)

Why (how) is it that an index is utilized on the first table? It's going to have to go through EVERY record in the first table anyway, and yet when I try to do this join without any index (or primary key.. none needed because this is all just ETL) on the first table, it crawls.

using innodb, by the way.

Help?

EDIT : the 2nd table is indexed. The first wasn't.

score 2 · Answer 1 · answered Jan 07 '11 at 22:56

2

This should shed some light on it: http://dev.mysql.com/doc/refman/5.5/en/innodb-index-types.html

In short: All InnoDB tables have so called 'clustered index' (even if no explicit index is defined on the table, InnoDB creates it automatically), in which actual rows are stored.

answered Jan 07 '11 at 22:56

Mchl

61,444
9
118
120

I'm just not seeing where a speed improvement by adding regular indices would come from, (though that article was very informative). My only thought now is that the first table was always the table that was searched, as opposed to fully scanned, and it's now doing it in log time for each entry now. That's all I can come up with. Surely the new index I created wouldn't be used if it were superfluous? And it's easy enough to disable while doing ETL. But I still don't understand... – Brian Rosamilia Jan 07 '11 at 23:33
TBH neither do I. InnoDB is not as intuitive as it might seem at first glance. Also, optimizer might actually choose to use an index even if it's not needed, or even worse slows things down. It just can occasionaly not choose the best execution plan for your query. That's why there are means of forcing index usage. – Mchl Jan 07 '11 at 23:51

score 1 · Accepted Answer · answered Jan 07 '11 at 22:38

1

I have no idea if this is what is happening, but it would, in theory, be possible (depending on the actual query) for the database engine to be scanning the index for the left table rather than the table itself. It could construct the necessary key data for that. If scanning the index were faster than scanning the table, that could account for the speed difference.

answered Jan 07 '11 at 22:38

Mark Wilkins

40,729
5
57
110

Describe does tell me it uses both table's indexes. Perhaps it brings the index into memory rather than much of the table? I'm just not sure as to how this situation is handled; hence the question. – Brian Rosamilia Jan 07 '11 at 22:43
@brian, that is certainly possible. A full sequential read of a file can be extremely fast. So if the index were in contiguous space on the disk and it were not overly large, it would make sense for it to be read into memory. – Mark Wilkins Jan 07 '11 at 22:54
Indeed. I should note there was NOTHING indexed (no pk either) on the first table prior to me realizing an index made it much faster. IE : no possible way to have anything about that table in contiguous memory that would easily fit into ram. Still I'd prefer a more concrete answer if anyone wants to chime in – Brian Rosamilia Jan 07 '11 at 22:57
I have a feeling this is the case. My query spent most of its time writing to a tmp table prior to these indices. The table is unlikely to fit into ram but the index certainly could. Really, it's probably just poor server configuration/hardware but it's just a local DB as apart of an ETL process so I'll just continue solving the problem this way. Thanks. – Brian Rosamilia Jan 07 '11 at 23:56

Schultz9999 · Answer 3 · 2011-01-07T23:36:08.637

The purpose of the primary index is put things in order by sorting and creating a big tree (at least in SQL Server). B-tree if to be more specific. This means each record's key belongs to some place (or bucket) in the tree.

alt text

So why adding a key to the FIRST table helps to speed up the query? The reason is that when the query is executed, the FIRST table is being sorted since the SECOND table is already sorted due to presence of a primary key. This is due to the simple fact that comparing two sorted lists is much faster than doing binary search for each element. In this case, since there is no index, sorting takes time.

By the way, don't be confused by what I say. It's not really comparing lists, but more pruning the index tree on the above picture, e.g. if the T1 has K1, K2, K3 and K1 can be found in second bucket on the picture then there is no need to check first bucket for the rest of the keys.

AH crap I thank you for your effort but my question is regarding the lack of indices on the FIRST TABLE ONLY. In your example you're still bound by the full scan (500k) of the first table which clearly doesn't answer my question as to how an index provides a speed improvement for a table whose records must all be evaluated anyway. — Brian Rosamilia, Jan 07 '11 at 22:48

score 0 · Answer 4 · answered Jan 07 '11 at 22:33

0

MySQL doesn't have hash joins.

answered Jan 07 '11 at 22:33

Ronnis

12,593
2
32
52

MySQL has hash joins as of 8.0.19 – Dave Stokes Feb 06 '20 at 12:16

Index on a table you must fully scan anyway? (MySQL)

4 Answers4