7

I have a non-indexed 2 billion rows table in a read-only SAS SPD server (bigtable). I have another 12 million rows table in my workspace (SAS_GRID) with a single column of unique ids (idlist). Both tables are updated constantly. I want to filter the bigtable based on idlist, something like:

create table filtered_bigtable as
select t1.* from bigtable t1 inner join idlist t2
on t1.id = t2.id;

What's the best strategy and code (sql or sas) for doing it quickly?

Edit: tried Robert's suggestion using a hash table and the query ran in only 40 minutes (10 times faster). Thanks for all the answers!

Will Razen
  • 307
  • 1
  • 11
  • Any interest in Hadoop or NoSQL? – Twelfth Nov 07 '17 at 22:27
  • SAS can connect to a Hadoop server...but if you don't have one already, moving data over will take too long anyways. And usually if you're on a Grid env add a new server type isn't an overnight solution. – Reeza Nov 07 '17 at 22:28
  • 1
    For a one-off query a hash lookup will only require one pass through bigtable. Creating the index will require reading the entire dataset and then writing the index back to disk, so it will always be slower for a single query. However, if you need to re-run the query (or a similar query), I'd expect that you'd get better performance from an index on the ID variable, since your're only selecting a small proportion of rows. It does somewhat depending on the distribution of the IDs that you want across the disk(s). – david25272 Nov 08 '17 at 02:06
  • @david25272 I don't have write access to bigtable though, and it is constantly updated. – Will Razen Nov 09 '17 at 20:14

4 Answers4

6

Hash tables of course!

Hash tables can be used as a very fast method of joining two tables. The smaller table is read into memory. The location in RAM for any given row from the small table can be identified by running the key (in this case id) through a very fast hashing algorithm. This circumvents the need for an index provided that you have enough memory to hold that keys from the smaller table in memory. With only 12m rows you should be fine.

Once the keys from the smaller table are read into memory, it simply iterates over the larger table, runs the hashing algorithm against each id in the large table to see if it finds a hit against the values from the small table registered in RAM, and if it does, outputs the row.

The only overhead is loading the small table into memory (very fast), and the hashing of each key in the big table (very fast). The memory lookup time may as well be considered instant.

It's incredibly efficient because it's only reading each table once from disk. Using an index effectively results in reading the smaller table many times (or at least the index of the smaller table).

data filtered_bigtable;

  set bigtable;

  if _n_ eq 1 then do;
    declare hash ht(dataset:'idlist') ;
    ht.definekey('id');
    ht.definedone();
  end;

  if ht.find() eq 0 then do;
    output;
  end;
run;

Hashtables can be used for all kinds of programmatic goodness in SAS so be sure to read up on them. They offer a lot more than just joins.

Also - be sure to keep just the columns you require from the larger table as this can reduce the amount of time spent reading in the rows from the larger table.

Robert Penridge
  • 8,424
  • 2
  • 34
  • 55
  • 1
    Great answer! For an additional performance boost you could also change the ht.find() to ht.check(). Instead of finding and returning the value it only checks to see if it exists. – Seba Nov 08 '17 at 09:17
  • For even more performance: if you have the ability to fit multiple copies of the hash into memory, submit a bunch of asynchronous SAS/CONNECT jobs, each reading a separate range of rows from the big table. – user667489 Nov 08 '17 at 14:14
  • @Seba isn't check() equivalent to find() if I haven't defined a data variable? – Will Razen Nov 08 '17 at 16:34
  • Could you recommend decent source to further read about hash tables? (Preferably on SAS environment.) – pinegulf Nov 09 '17 at 13:55
  • 2
    @pinegulf The two papers I refer to most often are "Data Step Hash Objects as Programming Tools" (https://pdfs.semanticscholar.org/9156/3cbb2a9e12a537ac2ee7a23f9ffd87bd0f2e.pdf#) and the "Sas Hash Object Tip Sheet" (https://support.sas.com/rnd/base/datastep/dot/hash-tip-sheet.pdf#) – Robert Penridge Nov 10 '17 at 15:40
  • Is that hash table code equivalent to create table filtered_bigtable as select t1.* from bigtable t1 inner join idlist t2 on t1.id = t2.id? I am new to hash table code. –  Nov 15 '18 at 15:58
0

You'll want to make sure the columns you are filtering on are indexed. I'm not too great with PROC SQL (I use SQL Server a lot more). In your case, you are filtering on t1.id. You'll want to make sure that column is indexed.

PROC SQL;
CREATE INDEX ID
ON BIGTABLE(ID);
RUN;

Similarly, you might want to go ahead and look at indexing the second table, as well.

Andrew
  • 373
  • 2
  • 8
  • I can't index the bigtable though, as my access is read-only. Even if I had write permission wouldn't it be too resource intensive to maintain an index? (bigtable is updated everyday) – Will Razen Nov 08 '17 at 16:38
  • 1
    In short, no. You'd want to look into index maintenance plans, but that's the whole point of indexing, rapid retrieval of data from large tables. Though I can't really speak for incremental loads in SAS... – Andrew Nov 08 '17 at 19:43
  • If you can schedule adding the index after bigtable is created it probably wont add that much time to the creation process, and it will certainly speed up ad-hoc queries. Time spend sitting on your hands waiting for an ad-hoc query costs a business much more than server time spent indexing a table in a batch job at midnight. – david25272 Nov 09 '17 at 22:12
0

SPDS has been always quite inefficient at joining tables as far as I remember. If you can load those 12M IDs into memory, creating a format and using it in a where condition would be my first choice.

I think SAS claims that a few years ago they enabled SPDS to use user-defined formats in pass-through, but no SAS consultant was ever able to show me how to make those formats work in put or input functions. Still, even if you read the bigtable and do the filtering outside of SPDS, it will be faster than megre join (imagine sorting bigtable) or hash join (SPDS can read records in parallel but hash join runs in a single thread as far as I remember).

Good luck!

Edit: Let me know if you need an example of creating the format and using it in the where clause.

Petr
  • 376
  • 1
  • 6
0

I have the feeling this method should be quite fast, but that's pure intuition, let us know which method's fastest.

PROC SQL;

create table filtered_bigtable as
select t1.* from bigtable as t1
where t1.id in (select distinct id from t2)
;
QUIT;
M4hd1
  • 133
  • 8