Fuzzy join without proc SQL

Question

Good day,

I wish to merge two dates to next closest.

Datasets are huge 500Mb to 1G so proc sql is out of the question.

I have two data sets. First (Fleet) has observations, second has date and which generation number to use for further processing. Like this:

data Fleet
    CreatedPortalDate 
    2013/2/19 
    2013/8/22 
    2013/8/25 
    2013/10/01 
    2013/10/07 

data gennum_list
    date 
    01/12/2014 
    08/12/2014 
    15/12/2014 
    22/12/2014 
    29/12/2014 
    ...

What I'd like to have is a link-table like this:

data link_table
    CreatedPortalDate date 
    14-12-03  01/12/2014 
    14-12-06  01/12/2014 
    14-12-09  08/12/2014 
    14-12-11  08/12/2014 
    14-12-14  08/12/2014

With logic that

Date < CreatedPortalDate and (CreatedPortalDate - date) = min(CreatedPortalDate - date)

What I came up with is a bit clunky and I'm looking for an efficient/better way to accomplish this.

data all_comb; 
  set devFleet(keep=createdportaldate);
  do i=1 to n;
    set gennum_list(keep=date) point=i nobs=n;
    if createdportaldate > date 
        and createdportaldate - 15 < date then do;/*Assumption, the generations are created weekly.*/
           distance= createdportaldate - date; 
          output; 
        end;
  end;
run;

proc sort data=all_comb; by createdportaldate distance; run;

data link_table; 
    set _all_comb(drop=distance); 
    by createdportaldate; 
    if first.createdportaldate; 
run;

Any ideas how to improve or approach this issue?
Ignorant idea: Could I create hash tables where distance would be stored.
Arrays maybe? somehow.

EDIT:

common format
- Done
Where does the billion rows come from?
- Yes, there are other data involved but the date is the only linking variable.
Sorted?
- Yes, the data is sorted and can be sorted again.
Are gen num dates always seven days apart ?
- No. That's the tricky part. Otherwise I could use weekand year(or other binning) as unique identifier.

Please edit so dates shown are all common format such as yyyy-mm-dd. You only list unique date values in the sample data. 100 years has only ~36,500 days. Where does the billion rows come from ? Are there other columns such as account id involved ? The sample data shown is sorted. Is the real data sorted ? Are gen num dates *always* seven days apart ? — Richard, Feb 05 '18 at 19:31

Richard · Answer 1 · 2018-02-05T21:12:07.390

Huge is a relative term, today's huge is tomorrow's speck.

Key data features indicate a direct addressing lookup scheme is possible

Date values are integers.
Date value ranges are limited.
A date value, or any of the next 14 days will be used as a lookup verifier
The key is a date value, which can be used as an array index.

Load the Gennum lookup once as follows

array gennum_of ( %sysfunc(today()) ) _temporary_;
if last_date then
  do index = last_date to date-1;
    gennum_of(index) = prev_date;
  end;

last_date = date;

And fetch a gennum as

if portaldate > last_date
  then portal_gennum = last_date;
  else portal_gennum = gennum_of ( portaldate );

If you have many rows due to grouping by account ids, you will have to clear and load up the gennum array per group.

Dirk Horsten · Answer 2 · 2018-02-05T22:37:31.587

This is a typical application of a sas `by` statement.

The by statement in a data step is meant to read two or more data sets at onece sorted by a common variable.

The common variable is the date, but it is named differently on both datasets. In sql, you solve that by requiring equality of the one variable to the other Fleet.CreatedPortalDate = gennum_list.date, but the by statement does not allow such construction, so we have to rename (at least) one of them while reading the datasets. That is waht we do in the rename clause within the options of gennum_list

data all_comb;
    merge gennum_list (in = in_gennum rename = (date = CreatedPortalDate))
          Fleet (in = in_fleet);
    by CreatedPortalDate;

I choose to combine the by statement with a merge statement, though a set would have done the job too, but then the order of both input datasets makes a difference.

Also note that I requested sas to create indicator variables in_gennum and in_fleet that indicate in which input dataset a value was present. It is handy to know that this type of variables id not written to the result data set.

However, we have to recover the date from the CreatedPortalDate, of course

    if in_gennum then date = CreatedPortalDate;

If you are new to sas, you will be surprised the above statement does not work unless you explicitly instruct sas to retain the value of date from one observation to the nest. (Observation is sas jargon for row.)

    retain date;

And here we write out one observation for each observation read from the Fleet dataset.

    if in_fleet then output;
run;

The advantages of this approach are

you need much less logic to correctly combine the observations from both input datasets (and that is what the data step is invented for)
you never have to retain an array of values in memory, so you can not have overflow problems
this sollution is of order 1 (O1), in the size of the datasets (apart from the sorting), so we know upfront that doubling the amount of data will only const double the time.
Disclaimer: this answer is under construction.

It will be tested later this week

Thanks for the input so far. Looking forward to more. However, I'm not sure if this works when the two dates (`created` and `date`) are not matching. Or am I missing something? — pinegulf, Feb 06 '18 at 08:57
You are probably missing the function of the `retain` statement. — Dirk Horsten, Feb 06 '18 at 14:18

Fuzzy join without proc SQL

2 Answers2

This is a typical application of a sas by statement.

The advantages of this approach are

Disclaimer: this answer is under construction.

This is a typical application of a sas `by` statement.