The table contains about 300 million rows. I need to select those rows based on two columns.
SELECT *
FROM table_1
WHERE column_1
IN (SELECT column FROM table_2)
AND column_2
IN (SELECT column FROM table_2)
table_1 has 300 million rows. table_2 has 1 million distinct rows.
I also used the exists method:
SELECT *
FROM table_1
WHERE EXISTS (
SELECT 1
FROM table_2
WHERE column=table_1.column_1)
AND EXISTS (
SELECT 1
FROM table_2
WHERE column=table_1.column_2)
But it is too slow. I created index on both columns in table_1 and column in table_2. It would take more than two hours on a 12G RAM Dell server.
Is there any better way to deal with such big table? Can Hadoop solve this problem?