-1

In a SAS data step, if one creates a character variable he has to be careful in choosing the right length in advance. The following data step returns a wrong result when var1=case2, since 'var2' is truncated to 2 characters and is equal to 'ab', which is obviously not what we want. The same happens replacing var2=' ' with length var2 $2. This kind of procedure is quite prone to errors.

 data b; set a; 
    var2 = '  ';
    if var1 = 'case1' then var2='xy';
    if var1 = 'case2' then var2='abcdefg';
run;

I was unable to find a way to just define 'var2' as a character, without having to care for its length (side note: if left unspecified, the length is 8). Do you know if it is possible?

If not, can you perhaps suggest a more robust turnoround, something similar to an sql "case", "decode", etc, to allocate different values to a new string variable that does not suffer from this length issue?

Giuseppe
  • 518
  • 10
  • 22
  • That's the default behaviour in SAS for any new character variable. You can assign the length ahead of usage as you've already noted. – Reeza Apr 16 '20 at 19:04
  • 1
    I don't understand what your issue is. SAS is flexible in not forcing you to define your variables, but then you have to live with the rules for how it defines them when you start referencing undefined variables. – Tom Apr 16 '20 at 19:41

1 Answers1

-1

SAS data step code is very flexible compared to most computer languages (and certainly compared to other languages created in the early 1970s) in that you are not forced to define variables before you start using them. The data step compiler waits to define the variable until it needs to. But like any computer program it has rules that it follows. When it cannot tell anything about the variable then it is defined as numeric. If it sees that the variable should be character it bases the decision on the length of the variable on the information available at the first reference. So if the first place you use the variable in your code is assigning it a string constant that is 2 bytes long then the variable has a length of 2. If it is the result of character function where the length is unknown then the default length is 200. If the reference is using a format or informat then the length is set to the appropriate length for the width of the format/informat. If there is no additional information then the length is 8.

You can also use PROC SQL code if you want. In that case the rules of ANSI SQL apply for how variable types are determined.

In your particular example the assignment of blanks to the variable is not needed since all newly created variables are set to missing (all blanks in the case of character variables) when the data step iteration starts. Note that if VAR2 is not new (ie it is already defined in dataset A) then you cannot change its length anyway.

So just replace the assignment statement with a length statement.

data b;
  set a; 
  length var2 $20;
  if var1 = 'case1' then var2='ab';
  if var1 = 'case2' then var2='abcdefg';
run;

SAS is not going the change the language at this point, they have too many users with existing code bases. Perhaps they will make a new language at some point in the future.

Tom
  • 47,574
  • 2
  • 16
  • 29
  • As I stated in the question, the "length var2 $##" does not solve the issue since I have to decide in advance the length - which I don't want to because it may be too short in following edits of the code and crop the var2 values. – Giuseppe Apr 17 '20 at 13:07
  • Your requested feature does not actually solve the problems caused by not defining your variables. It just punts it down the line to the next step in your program. – Tom Apr 17 '20 at 13:12
  • I am not following. If, for example, in SQL you create a table with a character variable then you have to comply with the chosen length, and it throws an error if you try not to do so by inserting a row wich has a column with too many charcters. SAS instead does not issue an error, it truncates the subsequent values of the variable. In 'case2', var2 will still be equal to 'ab' because SAS truncates it. This is an issue because I may inadvertently do it. It's human error that I want to prevent. – Giuseppe Apr 17 '20 at 14:55
  • Just saying that you are better off thinking ahead and defining your variables because not doing so will cause you more problems than just this trivial issue you are raising with the way that the data steps try to help you by defining variables for you. – Tom Apr 17 '20 at 14:59
  • Sure, but planning ahead is not sufficient when a program is shared with other people which may, possibly in haste, modify the program and not notice length constraints. I guess the only alternative is to choose a (wastefully) big length, which is unlikely to be exceeded. Ideally, I would have preferred SAS to either adjust length based on the input, or to issue an error if it is too long. But truncating without a warning... just invites human mistakes. – Giuseppe Apr 17 '20 at 15:06