2

I would like to identify the max number of consecutive consonants and vowels in an e-mail address, using SAS SQL (proc sql). The output should look like the one below in columns Max of consecutive consonants and max of consecutive vowels (I listed characters in first row for illustrative purposes only).

A few things to note:

  • treat special and numeric characters as a count terminator (e.g. 3rd email is a good example where you've got 3 consonants (hf) then numbers (98) and then again 2 consonants (jl). The output should be just 2 (hf).

  • I am only interested in the first part of the email (before @).

How do I achieve this, dear community?

E-mail                        Max of consecutive consonants Max of consecutive vowels
asifhajhtysiofh@gmail.com     5 (jhtys)                     2 (io)
chris.nashfield@hotmail.com   3                             2 
ahf98jla@gmail.com            2                             1 
Blake
  • 21
  • 3
  • Hi blake, try to do it and come back at least with a first attempt. Just to put you on the right path: with a data step it's easier than with a proc sql, use the scan() function to extract the part you want (before the @ or before numbers and special characters), length() to measure the number of characters, substr() to extract a character at a time and put it inside a do loop to handle a character at a time and compare it with the previous one. Good luck! :) – DaBigNikoladze Jun 23 '16 at 15:59

3 Answers3

4

There is a routine called prxnext that proves very handy here.

Generate sample data

data emails;
  input email $32.;
  datalines;
asifhajhtysiofh@gmail.com
chris.nashfield@hotmail.com
ahf98jla@gmail.com
;

Do the counting

data checkEmails(keep = email maxCons maxVow);
  set emails;

  * Consonants;
  re = prxparse("/[bcdfghjklmnpqrstvwxyz]+/");
  start = 1;
  stop = index(email,"@");
  do until (pos = 0);
    call prxnext(re,start,stop,email,pos,len);
    maxCons = max(maxCons, len);
  end;

  * Vowels;
  re = prxparse("/[aeiouy]+/");
  start = 1;
  stop = index(email,"@");
  do until (pos = 0);
    call prxnext(re,start,stop,email,pos,len);
    maxVow = max(maxVow, len);
  end;
run;

Results

Email                    MaxCons  MaxVow   
asifhajhtysiofh@gmail.com      5       2
chris.nashfield@hotmail.com    3       2
ahf98jla@gmail.com             2       1
Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
0

This was much trickier than I expected it to be, but I have a solution using macro loops that roughly follows the logic in @DaBigNikoladze's comment:

data temp;
   input email $40.;
   datalines;
    asifhajhtysiofh@gmail.com
    chris.nashfield@hotmail.com
    ahf98jla@gmail.com
    ;
run;

proc sql noprint;
    select max(length(email)) into: max_email_length from temp;
quit;

%let vowels = "a" "e" "i" "o" "u";
%let consonants = "q" "w" "r" "t" "y" "p" "s" "d" "f" "g" "h" "j" "k" "l" "z" "x" "c" "v" "b" "n" "m";

%macro counter;

data temp_count;
    set temp;

/*  limit email to just the part before the @*/
    email_short = substrn(email, 0, find(email, "@"));
    email_vowels_only = email_short;
    email_consonants_only = email_short;

/*  keep only the vowels or consonants, respectively*/
    %do i = 1 %to &max_email_length.;
        if substr(email_vowels_only, &i., 1) notin(&vowels.) then substr(email_vowels_only, &i., 1) = " ";
        if substr(email_consonants_only, &i., 1) notin(&consonants.) then substr(email_consonants_only, &i., 1) = " ";
    %end;
run;

/*  determine the max number of strings we have to scan through*/
proc sql noprint;
    select max(max(countw(email_vowels_only)), max(countw(email_consonants_only))) into: loops from temp_count;
quit;

/*  separate each string out into its own variable, find the max length of those variables, and drop those variables*/
proc sql;
    create table temp_count_expand (drop = vowel_word: consonant_word:) as select
        *,
        %do j = 1 %to &loops.; scan(email_vowels_only, &j.) as vowel_word&j., %end;
        %do k = 1 %to &loops.; scan(email_consonants_only, &k.) as consonant_word&k., %end;
        max(%do j = 1 %to &loops.; length(calculated vowel_word&j.), %end; .) as max_vowels,
        max(%do k = 1 %to &loops.; length(calculated consonant_word&k.), %end; .) as max_consonants
        from temp_count;
quit;

%mend counter;

%counter;
Sean
  • 1,120
  • 1
  • 8
  • 14
0

I'm not sure why you specify proc sql for this task. A data step is much more suitable as you can loop through the email, treating everything that is either a non-consonant or a non-vowel as a delimiter. I've used a regular expression (prxchange) to remove the @ portion of the email, although substr works just as well.

data have;
input Email $50.;
datalines;
asifhajhtysiofh@gmail.com
chris.nashfield@hotmail.com
ahf98jla@gmail.com
;
run;

data want;
set have;
length _w1 _w2 $50;
_short_email=prxchange('s/@.+//',-1,email); /* remove everything from @ onwards */
do _i = 1 by 1 until (_w1=''); /* loop through email, using everything other than consonants as the delimiter */
    _w1 = scan(_short_email,_i,'bcdfghjklmnpqrstvwxyz','ki');
    consonant = max(consonant,ifn(missing(_w1),0,length(_w1))); /* keep longest value */
end;
do _j = 1 by 1 until (_w2=''); /* loop through email, using everything other than vowels as the delimiter */
    _w2 = scan(_short_email,_j,'aeiou','ki');
    vowel = max(vowel,ifn(missing(_w2),0,length(_w2))); /* keep longest value */
end;
drop _: ; /* drop temprorary variables */
run;
Longfish
  • 7,582
  • 13
  • 19