2

Here's a very similar question

My question is a bit different from the one in the above link.

Background

I have a data set contains hourly data. So each object has 24 records per day. Now I want to create K new columns represents next 1,2,...K hourly records for each object. If not exist, replace them with missing values.

K is dynamic and is defined by users.

The original order must be preserved. No matter it's guaranteed in the data steps or by using sorting in the end.

I'm looking for an efficient way to achieve this.

Example

Original data:

Object Hour Value 
A      1    2.3
A      2    2.3
A      3    4.0
A      4    1.3

Given K = 2, desired output is

Object Hour Value Value1 Value2
A      1    2.3   2.3    4.0
A      2    2.3   4.0    1.3
A      3    4.0   1.3    .
A      4    1.3   .      .

Possible solutions

  1. sort in reverse order -> obtain previous k records -> sort them back.

When the no. of observation is large, this shouldn't be an ideal way.

  1. proc expand. I don't familiar with it cause it's never licensed on my pc.

  2. Using point in data step.

  3. retain statement inside data step. I'm not sure how this works.

Community
  • 1
  • 1
Lovnlust
  • 1,507
  • 3
  • 29
  • 53
  • Probably an example would help more in understanding the problem. – in_user May 19 '15 at 09:33
  • Why do you need multiple set statements? POINT doesn't work by reading the whole dataset. It randomly accesses just the one record. It could easily be in a loop. It's not as efficient as normal access, but it's not reading all records every access. – Joe May 19 '15 at 17:09
  • In any event: how is K specified by users? In data, in macro variables, in prompt, etc.? – Joe May 19 '15 at 17:09
  • K is defined as a macro variable. – Lovnlust May 20 '15 at 01:35

2 Answers2

3

You could transpose the hours and then freely access the hours ahead within each object. Just to set the value of K and generate some dummy data:

* Assign K ;
%let K=3 ;
%let Kn=value&k;

* Generate test objects each containing 24 hourly records ;
data time ;
  do object=1 to 10 ;
      do hour=1 to 24 ;
        value=round(ranuni(1)*10,0.1) ;
        output ;
      end ;
  end ;
run ;

EDIT: I updated the below step as realised the transpose isn't needed. Doing it all in one step gives ~20% improvement in CPU time

Use an array of the 24 hour values and loop through do i=1 to &k for each hour:

* Populate K variables ;
data output(keep=object hour value value1-&kn ) ;
  set time ;
  by object ;
  retain k1-k24 . ;
  array k(2,24) k1-k24 value1-value24 ;

  k(1,hour)=value ;

  if last.object then do hour=1 to 24 ;
    value=k(1,hour) ;
    do i=1 to &k ;
      if hour+i <=24 then k(2,i)=k(1,hour+i) ;
      else k(2,i)=.;
    end ;
    output ;
  end ;
run ;
Bendy
  • 3,506
  • 6
  • 40
  • 71
3

Assuming this is provided as a macro variable, this is pretty easily done with a side to side merge-ahead. Certainly faster than a transpose for K much larger than the total record count, and probably faster than looping POINTs.

Basically you merge the original dataset to itself, and use FIRSTOBS to push the starting point down one for each successive merge iteration. This needs a bit of extra work if you have BY groups that need protecting, but that's usually not too hard to manage.

Here's an example using SASHELP.CLASS:

%let K=5;

%macro makemergesets(k=, datain=, varin=, keepin=);
  %do _i = 2 %to &k;
    &datain (firstobs=&_i rename=&varin.=&varin._&_i. keep=&keepin. &varin.)
  %end;
%mend makemregesets;

data class_all;
  merge sashelp.class
    %makemergesets(k=&k,datain=sashelp.class, varin=age,keepin=)
  ;
run;
Joe
  • 62,789
  • 6
  • 49
  • 67
  • Even with `firstobs` option, comparing to a very large data size `n`, `makemergesets` will still read in the data set k times? – Lovnlust May 20 '15 at 01:38
  • It only reads from disk once. The rest of the 'reads' are buffered and take nearly no time. I tested this against the other solution posted and it was a few seconds faster on a large dataset, for example (though this as presented doesn't handle the by groups, that's not hard to add if needed). – Joe May 20 '15 at 01:44
  • If `firstobs` is not specified, it still reads from disk once? – Lovnlust May 20 '15 at 02:11
  • Firstobs has nothing to do with it, as far as I know. SAS buffers its reads when reading in the same dataset multiple times in one datastep. – Joe May 20 '15 at 02:12
  • I like it as it saves a lot of typing...! Although if speed is important I ran both answers on the test dataset expanding to 1M objects and this gave `cpu time 24.01 seconds` compared to the transpose which gave `cpu time 15.94 seconds` – Bendy May 20 '15 at 08:46
  • Mine on a larger dataset (2.4e8 rows) had nearly the same cpu time; slightly lower CPU time for the non-transpose solution, slightly lower disk time for this solution, but both within margin of error. They should be nearly identical as both read basically the same thing and write basically the same thing (yours writes a few more variables, I think, probably explaining the difference in disk time). – Joe May 20 '15 at 14:00