1

The Problem

We have a rapidly growing database with several large join tables (currently in the billions of rows), but as these tables have grown the query time has suffered. The concern is that as more data is added to the tables linked by these join tables, the join tables will continue to grow at a faster pace and adversely impact query speed.

The Background

I am dealing with a database that is storing genomic information. A number of markers (~3 million) corresponding to loci where there are DNA variations are linked to individuals that have had their genotype determined at these loci. Every marker has a number of possible genotypes of which every individual must have one.

The Current Implementation

When the database (postgresql) was still small, there were no problems in linking the genotypes to the the markers using foreign keys and then linking the individuals to their genotypes through a join table. That way, it would be easy to look up all of an individual's genotypes or look up all the individuals having a specific genotype.

A slimmed down version of these tables are listed below:

                                        Table "public.genotypes"
      Column      |            Type             |                       Modifiers                        
------------------+-----------------------------+--------------------------------------------------------
 id               | integer                     | not null default nextval('genotypes_id_seq'::regclass)
 ref_variation_id | integer                     | 
 value            | character varying(255)      |  
Indexes:
    "genotypes_pkey" PRIMARY KEY, btree (id)
    "index_genotypes_on_ref_variation_id" btree (ref_variation_id)


Table "public.genotypes_individuals"
    Column     |  Type   | Modifiers 
---------------+---------+-----------
 genotype_id   | integer | 
 individual_id | integer | 
Indexes:
    "index_genotypes_individuals_on_genotype_id_and_individual_id" UNIQUE, btree (genotype_id, individual_id)
    "index_genotypes_individuals_on_genotype_id" btree (genotype_id)

                                       Table "public.individuals"
    Column     |            Type             |                        Modifiers                         
---------------+-----------------------------+----------------------------------------------------------
 id            | integer                     | not null default nextval('individuals_id_seq'::regclass)
 hap_id        | character varying(255)      | 
 population_id | integer                     | 
 sex           | character varying(255)      | 
Indexes:
    "individuals_pkey" PRIMARY KEY, btree (id)
    "index_individuals_on_hap_id" UNIQUE, btree (hap_id)

The bottleneck right now is looking up all of the genotypes for an individual and having them sorted by their positions. This is used frequently and much more important than looking up individuals from a genotype. Examples of some of these queries are:

  • A simple lookup of all of an individual's genotypes

    SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 )

  • Normally, though this gets limited, because there are a lot of genotypes. We're often only interested in those on a specific chromosome.

    SELECT * FROM "genotypes" INNER JOIN "genotypes_individuals" ON "genotypes".id = "genotypes_individuals".genotype_id WHERE ("genotypes_individuals".individual_id = 2946 ) AND ("genotypes".ref_variation_id IN (37142, 37143, ...))

  • We also still need to occasionally go the other way.

    SELECT * FROM "individuals" INNER JOIN "genotypes_individuals" ON "individuals".id = "genotypes_individuals".individual_id WHERE ("genotypes_individuals".genotype_id = 53430)

Every time a new individual is added to the db, the join table grows by about 3 million rows. Intuitively from a design perspective, this seems bad because adding new individuals will slow down the performance on any process using the existing data.

I understand that databases are designed to handle large tables efficiently, but we are already hitting bottlenecks due to the drive IO. An individual query is still inconsequential, but 1000s of them add up quickly. We can alleviate this problem somewhat by spreading the db across multiple drives. However, I wanted to see if there are any other alternatives out there. I have been wondering if it is somehow possible to segregate the join table entries by individual_id, which would maybe leave lookups from individuals to genotypes unimpacted by adding additional individual-genotype rows to the join table. Or do indices already do that?

Connor
  • 11
  • 2
  • Could you discribe the tables included in this JOIN and also what types on INDEX you´re using? Wouldn´t hurt if you posted the actual query as well :) – Stefan Aug 25 '11 at 08:04
  • I edited the original post to include some more concrete information. Hope that helps. Thanks. – Connor Aug 25 '11 at 14:07
  • You said the query time suffered as the tables grew. As the tables grew into billions of rows, did the hardware grow too? – Mike Sherrill 'Cat Recall' Aug 26 '11 at 01:33
  • Yes, the hardware has been upgraded such as moving the database to an ssd drive. We're currently in the process of upgrading again. However, we should also be able to tackle this problem from a database design perspective as well. – Connor Aug 28 '11 at 15:28

2 Answers2

0

Have you looked at table partitioning?

Tony Andrews
  • 129,880
  • 21
  • 220
  • 259
  • No I have not. Thanks. This the sort of thing that I had in mind, but I didn't know what it was called. I'm going to start exploring this and see how it goes. My only concern at the moment is that since I'm not writing SQL statements directly (I'm using Rail's Active Record as we're doing web based data visualization) that the two might not play together so nicely. i.e. that you have to have triggers manage inserting and updating data so that it makes it to the right table. – Connor Aug 25 '11 at 14:13
0

I would consider testing a schema that used natural keys instead of id numbers.

Your lookup of all of an individual's genotypes

SELECT * 
FROM "genotypes" 
INNER JOIN "genotypes_individuals" 
        ON "genotypes".id = "genotypes_individuals".genotype_id 
WHERE ("genotypes_individuals".individual_id = 2946 )

becomes

SELECT * 
FROM genotypes_individuals
WHERE (individual_id = 2946)

Sometimes that's faster. Sometimes it's not.

Switching to natural keys on our production system increased median performance by a factor of 10. Some queries ran 100 times faster with natural keys, because natural keys eliminated a lot of joins. Some queries ran slower, too. But the median speed-up was impressive anyway.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185