12

If I have a table column with data and create an index on this column, will the index take same amount of disc space as the column itself?

I'm interested because I'm trying to understand if b-trees actually keep copies of column data in leaf nodes or they somehow point to it?

Sorry if this a "Will Java replace XML?" kind question.

UPDATE:

created a table without index with a single GUID column, added 1M rows - 26MB

same table with a primary key (clustered index) - 25MB (even less!), index size - 176KB

same table with a unique key (nonclustered index) - 26MB, index size - 27MB

So only nonclustered indexes take as much space as the data itself.

All measurements were done in SQL Server 2005

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
Valentin V
  • 24,971
  • 33
  • 103
  • 152

3 Answers3

4

The B-Tree points to the row in the table, but the B-Tree itself still takes some space on disk.

Some database, have special table which embed the main index and the data. In Oracle, it's called IOT -- index-organized table.

Each row in a regular table can be identified by an internal ID (but it's database specific) which is used by the B-Tree to identify the row. In Oracle, it's called rowid and looks like AAAAECAABAAAAgiAAA :)

If I have a table column with data and create an index on this column, will the index take same amount of disc space as the column itself?

In a basic B-Tree, you have the same number of node as the number of item in the column.

Consider 1,2,3,4:

    1 
  / 
2
   \ 3 
      \ 4

The exact space can still be a bit different (the index is probably a bit bigger as it need to store links between nodes, it may not be balanced perfectly, etc.), and I guess database can use optimization to compress part of the index. But the order of magnitude between the index and the column data should be the same.

ewernli
  • 38,045
  • 5
  • 92
  • 123
3

I'm almost sure it's quite a DB dependent, but generally – yeah, they take additional space. This happens because of two reasons:

  1. This way you can utilize the fact the data in BTREE leafs is sorted;

  2. You gain lookup speed advantage as you don't have to seek back and forth to fetch neccessary stuff.

PS just checked our mysql server: for a 20GB table indexes take 10GB of space :)

Anton
  • 2,483
  • 2
  • 23
  • 35
0

Judging by this article, it will, in fact, take at least the same amount of space as the data in the column (in PostgreSQL, anyway). The article also goes to suggest a strategy to reduce disk and memory usage.

A way to check for yourself would be to use e.g. the derby DB, create a table with a million rows and a single column, check it's size, create an index on the column and check it's size again. If you take the 10-15 minutes to do so, let us know the results. :)

Tomislav Nakic-Alfirevic
  • 10,017
  • 5
  • 38
  • 51
  • I thought it depends on how many of the rows are indexed. For instance, if a table contains 30 rows but only 2 are searchable then you'd only need 2 b-trees for those specific rows. That's why it's important to **not** index rows that won't be used for lookups. – Evan Plaice Jul 05 '12 at 22:48
  • @Evan: I'm not sure I follow...did you mean "depends on how many of the columns are indexed"? – Tomislav Nakic-Alfirevic Aug 17 '12 at 12:57
  • Yes... I said columns didn't I ;) – Evan Plaice Aug 17 '12 at 20:21
  • In that case, sure, indexes aren't "good" by default: they come at a price, both in terms of RAM and disk (index storage) and in terms of CPU cycles (C-UD operations require indexes to be updated). – Tomislav Nakic-Alfirevic Aug 18 '12 at 23:17
  • Exactly. So if you have a high throughput table like a FIFO task queue, indexing any column can have a negative performance impact. Even on tables with a lower write rate (ex persistent data store) it'll save some time/resources to exclude columns that you only expect to be visible in the results. Generally, foreign keys are almost always a good idea to index because every additional foreign key lookup on a join adds an additional table search to that query. Joins can get expensive very fast, that's why many of the new no-SQL data architectures don't allow them. – Evan Plaice Aug 20 '12 at 17:41