Hash tables of course!
Hash tables can be used as a very fast method of joining two tables. The smaller table is read into memory. The location in RAM for any given row from the small table can be identified by running the key (in this case id
) through a very fast hashing algorithm. This circumvents the need for an index provided that you have enough memory to hold that keys from the smaller table in memory. With only 12m rows you should be fine.
Once the keys from the smaller table are read into memory, it simply iterates over the larger table, runs the hashing algorithm against each id in the large table to see if it finds a hit against the values from the small table registered in RAM, and if it does, outputs the row.
The only overhead is loading the small table into memory (very fast), and the hashing of each key in the big table (very fast). The memory lookup time may as well be considered instant.
It's incredibly efficient because it's only reading each table once from disk. Using an index effectively results in reading the smaller table many times (or at least the index of the smaller table).
data filtered_bigtable;
set bigtable;
if _n_ eq 1 then do;
declare hash ht(dataset:'idlist') ;
ht.definekey('id');
ht.definedone();
end;
if ht.find() eq 0 then do;
output;
end;
run;
Hashtables can be used for all kinds of programmatic goodness in SAS so be sure to read up on them. They offer a lot more than just joins.
Also - be sure to keep just the columns you require from the larger table as this can reduce the amount of time spent reading in the rows from the larger table.