2

What would be the data step equivalent of this proc sql?

proc sql;
create table issues2 as(
select request,
       area,
       sum(issue_count) as issue_count,
       sum(resolved_count) as resolved_count
    from 
        issues1
    group by request, area 
                         );
user2941280
  • 285
  • 3
  • 9
  • 16
  • You'll need proc means to do that, data steps do not do aggregations – scott Apr 16 '14 at 18:00
  • 1
    Is there any specific reason why you want to use DATA-step for that? Because actually PROC MEANS or PROC SUMMARY are designed for this. – Dmitry Shopin Apr 16 '14 at 18:00
  • Not really, just curious. I thought I had used a data step for this once before and was trying to remember how. Looks like I was mistaken and it was probably proc means. – user2941280 Apr 16 '14 at 18:02
  • You certainly could do it in a datastep, just not as simply as PROC MEANS. – Joe Apr 16 '14 at 18:03

4 Answers4

3

PROC MEANS/SUMMARY is better, but if it's relevant, the actual data step solution is as follows. Basically you just reset the counter to 0 on first.<var> and output on last.<var>, where <var> is the last variable in your by group.

Note: This assumes the data is sorted by request area. Sort it if it is not.

data issues2(rename=(issue_count_sum=issue_count resolved_count_sum=resolved_count) drop=issue_count resolved_count);
 set issues1;
 by request area;
 if first.area then do;
   issue_count_sum=0;
   resolved_count_sum=0;
 end;
 issue_count_sum+issue_count;
 resolved_count_sum+resolved_count;
 if last.area then output;
run;
Joe
  • 62,789
  • 6
  • 49
  • 67
  • Joe, You will need to do a proc sort before using the first. and last. variable operations no? – Keneni Apr 16 '14 at 18:25
  • 2
    If the data is not already sorted, it would need to be, yes. I'll add that as a note. (Although, you could do this when not sorted, if you intentionally did not want to group things together across gaps, if you add `notsorted` to the by statement. That would give a different result though unless the data was actually sorted.) – Joe Apr 16 '14 at 18:36
  • Joe, would you not need to retain issue_count_sum and resolved_count_sum? Can't test for myself atm I'm afraid. – Nate Apr 16 '14 at 22:24
  • @Nate They're retained automatically due to the use of the `+` operator. They also don't have to worry about adding something to missing as a result, so if you preferred to skip one initialization step, you could. – Joe Apr 16 '14 at 22:27
2

The functional equivalent of what you're trying to do is the following:

data _null_;
  set issues1(rename=(issue_count=_issue_count
                      resolved_count=_resolved_count)) end=done;

  if _n_=1 then do;
    declare hash total_issues();
    total_issues.defineKey("request", "area");
    total_issues.defineData("request", "area", "issue_count", "resolved_count");
    total_issues.defineDone();
  end;

  if total_issues.find() ne 0 then do;
    issue_count = _issue_count;
    resolved_count = _resolved_count;
  end;
  else do;
    issue_count + _issue_count;
    resolved_count + _resolved_count;
  end;

  total_issues.replace();

  if done then total_issues.output(dataset: "issues2");
run;

This method does not require you to to pre-sort the dataset. I wanted to see what kind of performance you'd get with using different methods so I did a few tests on a 74M row dataset. I got the following run-times (your results may vary):

Unsorted Dataset:

  • Proc SQL - 12.18 Seconds
  • Data Step With Hash Object Method (above) - 26.68 Seconds
  • Proc Means using a class statement (nway) - 5.13 Seconds

Sorted Dataset (36.94 Seconds to do a proc sort):

  • Proc SQL - 10.82 Seconds
  • Proc Means using a by statement - 9.31 Seconds
  • Proc Means using a class statement (nway) - 6.07 Seconds
  • Data Step using by statement (I used the code from Joe's answer) - 8.97 Seconds

As you can see, I wouldn't recommend using the data step with the hash object method shown above since it took twice as long as the proc sql.

I'm not sure why proc means with a bystatement took longer then proc means with a class statement, but I tried this on a bunch of different datasets and saw similar differences in runtimes (I'm using SAS 9.3 on Linux 64).

Something to keep in mind is that these runtimes might be completely different for your situation but I would recommend using the the following code to do the summation:

proc means data=issues1 noprint nway;
  class request area;
  var issue_count resolved_count;
  output out=issues2(drop=_:) sum=;
run;
Leo
  • 2,088
  • 1
  • 15
  • 14
  • PROC MEANS with CLASS is basically making hash tables in memory, while with BY is doing it a bit more manually (read in the rows for one BY group, accumulate a row at a time, output a row when finished). CLASS is always a bit faster than BY as a result. I imagine my solution and PROC MEANS with BY are roughly equivalent in how they work and thus have basically the same runtime. – Joe Apr 16 '14 at 21:19
  • Your HASH solution is likely slower due to size-of-memory issues. I don't know what you ran it on, but unless you had enough room to store it all in physical memory, you may have had some problems with memory swapping to physical disk. Temporary arrays might actually be equivalent or faster, if you have enough memory to store them. – Joe Apr 16 '14 at 21:21
  • Yea, that's what it seems like. The BY and CLASS thing really surprised me because I've always used BY when I had a sorted dataset thinking that it would run faster. – Leo Apr 16 '14 at 21:22
  • I don't think I ran into memory size issues. I ran it on a server with 48GB RAM and no other jobs running at the time. I'm not sure how you'd use a temporary array in this situation (especially if the dataset isn't sorted). – Leo Apr 16 '14 at 21:30
  • Temporary arrays are the old hash tables. They can do almost exactly the same things hash tables can, sometimes faster sometimes slower. I'll post an example. – Joe Apr 16 '14 at 21:31
  • One other note: you need to include request and area on your `DefineData` or they won't be output, which is sort of important. You also might want to add `ordered:'a'` to the `declare hash` statement in order to get ordered output (it doesn't seem to cost anything to do so). – Joe Apr 16 '14 at 21:51
  • Good catch. I found the error when I was doing the runtimes but I had already pasted the code in the answer box and forgot to change it there. – Leo Apr 16 '14 at 21:57
1

Awkward, I think, to do it in a data step at all - summing and resetting variables at each level of the by variables would work. A hash object might also do the trick.

Perhaps the simplest non-Proc SQL method would be to use Proc Summary:-

proc summary data = issues1 nway missing;
  class request area;
  var issue_count resolved_count;
  output out = issues2 sum(issue_count) = issue_count sum(resolved_count) = resolved_count ;
run;
rambles
  • 706
  • 5
  • 9
1

Here's the temporary array method. This is the "simplest" of them, making some assumptions about the request and area values; if those assumptions are faulty, as they often are in real data, it may not be quite as easy as this. Note that while in the below the data does happen to be sorted, I don't rely on it being sorted and the process don't gain any advantage from it being sorted.

data issues1;
do request=1 to 1e5;
  do area = 1 to 7;
    do issueNum = 1 to 1e2;
      issue_count = floor(rand('Uniform')*7);
      resolved_count = floor(rand('Uniform')*issue_count);
      output;
    end;
  end;
end;
run;

data issues2;
set issues1 end=done;
array ra_issue[1100000] _temporary_;
array ra_resolved[1100000] _temporary_;
*array index = request||area, so request 9549 area 6 = 95496.;
ra_issue[input(cats(request,area),best7.)] + issue_count;
ra_resolved[input(cats(request,area),best7.)] + resolved_count;
if done then do;
  do _t = 1 to dim(ra_issue);
    if not missing(ra_issue[_t]) then do;
      request = floor(_t/10);
      area    = mod(_t,10);
      issue_count=ra_issue[_t];
      resolved_count=ra_resolved[_t];
      output;
      keep request area issue_count resolved_count;
    end;
  end;
end;
run;

That performed comparably to PROC MEANS with CLASS, given the simple data I started it with. If you can't trivially generate a key from a combination of area and request (if they're character variables, for example), you would have to store another array of name-to-key relationships which would make it quite a lot slower if there are a lot of combinations (though if there are relatively few combinations, it's not necessarily all that bad). If for some reason you were doing this in production, I would first create a table of unique request+area combinations, create a Format and an Informat to convert back and forth from a unique key (which should be very fast AND give you a reliable index), and then do this using that format/informat rather than the cats / division-modulus that I do here.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • I can see how that would be faster then using hash objects (though it does require you to know more about your data). – Leo Apr 16 '14 at 22:11
  • 1
    It's not ultimately a good solution for _most_ cases, of course, but it's technically possible. I think a 'hash of hashes' technique might be faster for the hash solution, actually; the problem with the hash solution is it takes too long to search the hash every single time. Build a hash for each area containing all of the requests in that area and you might have much better performance there. – Joe Apr 16 '14 at 22:18