So, you're right in one sense that DS/2 may be helpful here. However, I suspect it's a bit more complicated.
DS/2 will happily thread data steps, but what is going to be more challenging is writing to several different datasets. That's because there's not a great way to structure the output dataset name without using the macro language, which won't play with the threading very well as far as I can tell (though I'm no expert here).
Here's an example of it using threading:
PROC DS2;
thread in_thread/overwrite=yes;
dcl bigint count;
drop count;
method init();
count=0;
end;
method run();
set in_data;
count+1;
output;
end;
method term();
put 'Thread' _threadid_ ' processed' count 'observations.';
end;
endthread;
run;
data out_data/overwrite=yes;
dcl thread in_thread t_in; /* instance of the thread */
method run();
set from t_in threads=4;
output;
end;
enddata;
run;
quit;
But this just writes one dataset out, and if you change threads=4
to 1, it doesn't actually take any longer. Both are okay speed-wise, though actually slower than the regular data step (about 1.8x the speed for me). DS/2 uses a much, much slower method to access data under the hood than SAS's base data step when accessing SAS datasets; DS/2's speed gains really come into play when you're working in RDBMSs via SQL or similar.
However, there's no good way to drive the output in parallel. Here's the version of the above turned into making 4 datasets. Notice that the actual selection of where to output is in the main, non-threaded data step...
PROC DS2;
thread in_thread/overwrite=yes;
dcl bigint count;
dcl bigint thisThread;
drop count;
method init();
count=0;
end;
method run();
set in_data;
count+1;
thisThread = _threadid_;
output;
end;
method term();
put 'Thread' _threadid_ ' processed' count 'observations.';
end;
endthread;
run;
data a b c d/overwrite=yes;
dcl thread in_thread t_in; /* instance of the thread */
method run();
set from t_in threads=4;
select(thisThread);
when (1) output a;
when (2) output b;
when (3) output c;
when (4) output d;
otherwise;
end;
end;
enddata;
run;
quit;
So it's actually quite a lot slower than in the non-threaded version. Oops!
Really, your issue here is that disk i/o is the main problem, not CPU. Your CPU does virtually no work here. DS/2 might be able to help in some edge cases where you have a really fast SAN that allows tons of simultaneous writes, but ultimately it takes X amount of time to read those million records and same X amount of time to write a million records, based on your i/o constraint, and odds are parallelizing that won't help.
Hash tables will add a lot more I suspect, and could certainly be used here with DS/2; see my new answer on the other question linked in OP for the data step version. DS/2 probably won't make that solution any faster, more likely slower; but you could implement roughly the same thing in DS/2 if you wanted, and then the sub-thread would be able to output on its own without involving the master thread.
Where DS/2 would be helpful would be if you're doing this in Teradata or something, where you can use SAS's in-database accelerator to execute this code database-side. That would make things a lot more efficient. Then you could use something similar to my code above, or better yet a hash solution.