0

The table contains about 300 million rows. I need to select those rows based on two columns.

SELECT * 
FROM table_1 
WHERE column_1 
IN (SELECT column FROM table_2) 
AND column_2 
IN (SELECT column FROM table_2)

table_1 has 300 million rows. table_2 has 1 million distinct rows.

I also used the exists method:

SELECT * 
FROM table_1 
WHERE EXISTS (
SELECT 1 
FROM table_2 
WHERE column=table_1.column_1) 
AND EXISTS (
SELECT 1 
FROM table_2 
WHERE column=table_1.column_2)

But it is too slow. I created index on both columns in table_1 and column in table_2. It would take more than two hours on a 12G RAM Dell server.

Is there any better way to deal with such big table? Can Hadoop solve this problem?

Ullas
  • 11,450
  • 4
  • 33
  • 50
Ben
  • 665
  • 1
  • 10
  • 27

2 Answers2

0

USE THIS :

SELECT * 
FROM table_1 
INNER JOIN
    (SELECT DISTINCT column FROM table_2) tab2_1
        ON colum_1 = tab2_1.column 
INNER JOIN
    (SELECT DISTINCT column FROM table_2) tab2_2
        ON colum_2 = tab2_2.column

Hope this will helps you

Youssef DAOUI
  • 329
  • 1
  • 6
  • I know this method. And I searched the difference between in, exists, and join. I do not think Join would be faster. But I will try.Thanks. – Ben Aug 07 '14 at 09:02
0

With a such big database, I would create a materialized view on this query, and then can do a simple SELECT * FROM table_view :

CREATE MATERIALIZED VIEW table_view AS
     SELECT * FROM table_1 
     WHERE column_1 IN (SELECT column FROM table_2) 
     AND column_2 IN (SELECT column FROM table_2);

And you just have to create a TRIGGER to update this view whenever you add or remove a row from table1 or table2.

Kabulan0lak
  • 2,116
  • 1
  • 19
  • 34