2

I'm working with client data in SAS with sensitive customer identification information. The challenge is to mask the field in such a way that it remains numeric/alphabetic/alphanumeric. I found a way of using Bitwise function in SAS (BXOR, BOR, BAND) but the output is full of special characters which SAS cant handle/sort/merge etc.

I also thought of scrambling the field itself, based on a key, but haven't been able to see it through. Following are the challenges:

1) It HAS to be key based 2) HAS to be reversible. 3) Masked/scrambled field has to be numeric/alphabetic/alphanumeric only so it can be used in SAS. 4) The field to be masked has both alphabets and numbers but has varying lengths and with millions of observartions.

Any tips on how to achieve this masking/scrambling would be greatly appreicated :(

Hadi
  • 36,233
  • 13
  • 65
  • 124
Karan
  • 31
  • 1
  • 3
  • How about using a lookup table or a format? – Robert Penridge May 16 '13 at 13:53
  • What does "numeric/alphabetic/alphanumeric only" mean? What else could it be?? Or are you saying it has to retain the same type [so a numeric variable has to stay numeric and a character variable has to stay character]? – Joe May 16 '13 at 13:55
  • 1
    Also, what sort of security are we looking for? Is this "keep someone from reading the data on the screen" secure, or "bank records that someone might try to hack" secure? – Joe May 16 '13 at 14:26
  • Rob - The client rejected a single lookup saying it's easy to break that cipher if someone wanted to. I finally developed a code that uses multiple lookups which hopefully will suffice. Joe - The output should not have any special characters. I tried converting the input to ASCII and then to binary, perform operations on them and then convert them back but the output was garbled for SAS to read. The variable is a unique identifier of a customer so the client wants to hide the value. – Karan May 28 '13 at 07:26

2 Answers2

2

Here is a simple key-based solution. I present the data step solution here, and then will present a FCMP version in a bit. I keep everything in the range of 48 to 127 (Numbers, letters, and common characters such as @ > < etc.); that's not quite alphanumeric but I can't imagine why it would matter in this case. You could reduce it further to only truly alphanumeric using this same method, but it would make the key much worse (only 62 values) and be clunky to work with (as you have 3 noncontiguous ranges).

data construct_key;
length keystr $1500;
do _t = 1 to 1500;
  _rannum = ceil(ranuni(7)*80);
  *if _rannum=12 then _rannum=-15;
  substr(keystr,_t,1)=byte(47+_rannum);

end;
call symput('keystr',keystr);
run;
%put %bquote(&keystr);



data encrypted;
set sashelp.class;
retain key "&keystr";
length name_encrypt $30;
do _t = 1 to length(name);
  substr(name_encrypt,_t,1) = byte(mod(rank(substr(name,_t,1)) + rank(substr(key,1,1))-94,80)+47);
  key = substr(key,2);
end;
keep name:;
run;

data unencrypted;
set encrypted;
retain key "&keystr";
length name_unenc $30;
do _t = 1 to length(name_encrypt);
  substr(name_unenc,_t,1) = byte(
      mod(80+rank(substr(name_encrypt,_t,1)) - rank(substr(key,1,1)),80)
+47);
  key = substr(key,2);
end;
run;

In this solution, there is a medium level of encryption - a key with 80 possible values is not strong enough to deter a truly sophisticated hacker, but is strong enough for most purposes. You need to pass either the key itself or the seed to the key algorithm in order to unencrypt; if you use this multiple times, make sure to pick a new seed each time (and not something related to the data). If you seed with zero (or a nonpostive integer) you will effectively guarantee a new key each time, but you will have to pass the key itself rather than the seed, which may present some data security issues (obviously, the key itself can be obtained by a malicious user, and would have to be stored in a different location than the data). Passing the key by way of the seed is probably better, as you could pass that verbally over the telephone or through some sort of prearranged list of seeds.

I'm not sure I recommend this sort of approach in general; a superior approach may well be to simply encrypt the entire SAS dataset using a superior encryption method (PGP, for example). Your exact solution may vary, but if you have for example some customer information that isn't actually necessary for most steps of your process, you may be better off separating that information from the rest of the (non-sensitive) data and only incorporating that when it's needed.

For example, I have a process whereby I pull sample for a client for a healthcare survey. I select valid records from a dataset that has no information for the customer except a numeric unique identifier; once I have narrowed the sample down to the valid records, then I attach the customer information from a separate dataset and create the mailing files (which are stored in an encrypted directory). That keeps the data nonsensitive for as long as possible. It's not perfect - the unique numeric identifier still means there is a tie back, even if it's not to anything someone would know outside of the project - but it keeps things safe as long as possible on our end.

Here is the FCMP version:

%let keylength=5;
%let seed=15;

proc fcmp outlib=work.funcs.test;
subroutine encrypt(value $,key $);
  length key $&keylength.;
  outargs value,key;
  do _t = 1 to lengthc(value);
    substr(value,_t,1) = byte(mod(rank(substr(value,_t,1)) + rank(substr(key,1,1))-62,96)+31);
    key = substr(key,2)||substr(key,1,1);
  end;
endsub;

subroutine unencrypt(value $,key $);
  length key $&keylength.;
  outargs value,key;
  do _t = 1 to lengthc(value);
    substr(value,_t,1) = byte(mod(96+rank(substr(value,_t,1)) - rank(substr(key,1,1)),96)+31);
    key = substr(key,2)||substr(key,1,1);
  end;
endsub;

subroutine gen_key(seed,keystr $);
  outargs keystr;
  length keystr $&keylength.;
  do _t = 1 to &keylength.;
    _rannum = ceil(ranuni(seed)*80);    
    substr(keystr,_t,1)=byte(47+_rannum);
  end;
endsub;
quit;

options cmplib=work.funcs;



data encrypted;
set sashelp.class;
length key $&keylength.;
retain key ' '; *the missing is to avoid the uninitialized variable warning;
if _n_ = 1 then call gen_key(&seed,key);
call encrypt(name,key);
drop key;
run;

data unencrypted;
set encrypted;
length key $&keylength.;
retain key ' ';
if _n_ = 1 then call gen_key(&seed,key);
call unencrypt(name,key);
run;

This is somewhat more robust; it allows characters from 32 to 127 rather than from 48, meaning it deals with space successfully. (Tab will still not decode properly - it would beocme a 'k'.) You pass the seed to call gen_key and then it uses that key for the remainder of the process.

It goes without saying that this is not guaranteed to function for your purposes and/or to be a secure solution and you should consult with a security professional if you have substantial security needs. This post is not warranted for any purpose and any and all liability arising from its use is disclaimed by the poster.

Joe
  • 62,789
  • 6
  • 49
  • 67
1

SAS have an article on their website on how to encrypt specific variables. Hopefully this will help you.

link

Longfish
  • 7,582
  • 13
  • 19
  • This is what I finally used :) – Karan May 28 '13 at 07:29
  • The link is dead now. Please try to give the most important information in the answer itself so that users in the future may not be frustrated because the link rotted away. – Secespitus Sep 29 '17 at 08:33