How to select efficiently in a very large table

Question

The table contains about 300 million rows. I need to select those rows based on two columns.

SELECT * 
FROM table_1 
WHERE column_1 
IN (SELECT column FROM table_2) 
AND column_2 
IN (SELECT column FROM table_2)

table_1 has 300 million rows. table_2 has 1 million distinct rows.

I also used the exists method:

SELECT * 
FROM table_1 
WHERE EXISTS (
SELECT 1 
FROM table_2 
WHERE column=table_1.column_1) 
AND EXISTS (
SELECT 1 
FROM table_2 
WHERE column=table_1.column_2)

But it is too slow. I created index on both columns in table_1 and column in table_2. It would take more than two hours on a 12G RAM Dell server.

Is there any better way to deal with such big table? Can Hadoop solve this problem?

Post the `explain` of the query. http://www.postgresql.org/docs/current/static/sql-explain.html — Clodoaldo Neto, Aug 07 '14 at 09:01

score 0 · Answer 1 · answered Aug 07 '14 at 08:36

0

USE THIS :

SELECT * 
FROM table_1 
INNER JOIN
    (SELECT DISTINCT column FROM table_2) tab2_1
        ON colum_1 = tab2_1.column 
INNER JOIN
    (SELECT DISTINCT column FROM table_2) tab2_2
        ON colum_2 = tab2_2.column

Hope this will helps you

answered Aug 07 '14 at 08:36

Youssef DAOUI

329
1
6

I know this method. And I searched the difference between in, exists, and join. I do not think Join would be faster. But I will try.Thanks. – Ben Aug 07 '14 at 09:02

score 0 · Answer 2 · answered Aug 07 '14 at 08:44

With a such big database, I would create a materialized view on this query, and then can do a simple SELECT * FROM table_view :

CREATE MATERIALIZED VIEW table_view AS
     SELECT * FROM table_1 
     WHERE column_1 IN (SELECT column FROM table_2) 
     AND column_2 IN (SELECT column FROM table_2);

And you just have to create a TRIGGER to update this view whenever you add or remove a row from table1 or table2.

How to select efficiently in a very large table

2 Answers2

Linked