1

I'm getting a "The remote Process is out of memory" in SAS DIS (Data Integration Studio): enter image description here

Since it is possible that my approach is wrong, I'll explain the problem I'm working on and the solution I've decided on:

I have a large list of customer names which need cleaning. In order to achieve this, I use a .csv file containing regular expression patterns and their corresponding replacements; (I use this approach since it is easier to add new patterns to the file and upload it to the server for the deployed job to read from rather than harcoding new rules and redeploying the job).

In order to get my data step to make use of the rules in the file I add the patterns and their replacements to an array in the first iteration of my data step then apply them to my names. Something like:

DATA &_OUPUT;

    ARRAY rule_nums{1:&NOBS} _temporary_;
    IF(_n_ = 1) THEN
        DO i=1 to &NOBS;
            SET WORK.CLEANING_RULES;
            rule_nums{i} = PRXPARSE(CATS('s/',rule_string_match,'/',rule_string_replace,'/i'));
        END;
    SET WORK.CUST_NAMES;
    customer_name_clean = customer_name;
    DO i=1 to &NOBS;
        customer_name_clean = PRXCHANGE(a_rule_nums{i},1,customer_name_clean);
    END;
RUN;

When I run this on around ~10K rows or less, it always completes and finishes extremely quickly. If I try on ~15K rows it chokes for a super long time and eventually throws an "Out of memory" error.

To try and deal with this I built a loop (using the SAS DIS loop transformation) wherein I number the rows of my dataset first, then apply the preceding logic in batches of 10000 names at a time. After a very long time I got the same out of memory error, but when I checked my target table (Teradata) I noticed that it ran and loaded the data for all but the last iteration. When I switched the loop size from 10000 to 1000 I saw exactly the same behaviour.

For testing purposes I've been working with only around ~500K rows but will soon have to handle millions and am worried about how this is going to work. For reference, the set of cleaning rules I'm applying is currently 20 rows but will grow to possibly a few hundred.

  • Is it significantly less efficient to use a file with rules rather than hard coding the regular expressions directly in my datastep?
  • Is there any way to achieve this without having to loop?
  • Since my dataset gets overwritten on every loop iteration, how can there be an out of memory error for datasets that are 1000 rows long (and like 3 columns)?
  • Ultimately, how do I solve this out of memory error?

Thanks!

Rookatu
  • 1,487
  • 3
  • 21
  • 50
  • You're using a temporary array and loading a data set into memory. You must have a very low limit if 20 rows are causing issues. Can you preprocess the data (PRXPARSE) and only load the data in this step, and make sure to limit the length of the string so it's not storing it with a 200 length or something. – Reeza Dec 29 '16 at 21:35
  • Did your partition your list of names to clean or the list of rules to apply? It would seem to me that the list of rules is what is going to take memory. – Tom Dec 29 '16 at 21:43
  • @Tom the list of rules is only 20 rows, should it really take a lot of memory? Not sure what you mean by "partition"; I did break the dataset of names to be cleaned (originally ~500K) into partitions of size 1000 or 10000. Neither worked. Is this what you mean? – Rookatu Dec 29 '16 at 22:06
  • Check your settings, it may not be appropriate for what you're working with. To run out of memory with the specs you've mentioned doesn't make sense. Are you perhaps working in a test env that is underpowered compared to a production environment? 20 rows and 1 million should be a trivial process in a data step. – Reeza Dec 30 '16 at 01:02
  • Is it just the LOG that is too large? Can you run with a small enough set of rules or data so that you can retrieve the log and see why it is generating such a large log file? – Tom Jan 01 '17 at 18:17

1 Answers1

0

The issue turned out to be that the log that the job was generating was too large. The possible solutions are to disable logging or to redirect the log to a location which can be periodically purged and/or has enough space.

Rookatu
  • 1,487
  • 3
  • 21
  • 50