how can i create a variable depends on other observations

Question

I have a college data, problem is finding if the same id, same course1 and course2 have next term or not. I have the table as below first 4 variables: id, term, course1 and course2. I am trying to create 5th variable 'nextterm'. Terms are like this: 201010-201020-201030-201110-201120-201130-201210,... so on. So id=21, course1 and course 2 are same MAT 51 for 201010 AND 201020. So 201010 is going to be Yes, 201020 is going to be No.

  id    term    course1    course2                nextterm
  21    201010  MAT        41                        No
  21    201010  MAT        51                        Yes
  21    201020  MAT        51                        No
  21    201020  SPC        13                        No
  29    201130  pos        94                        Yes
  29    201210  pos        94                        No

Haikuo Bian · Accepted Answer · 2015-06-19T14:49:08.150

0

Ok, this is the updated answer, it requires some understanding of SQL at certain level, it works if your data volume is not too large. Please note, I have converted the term into number in the incoming dataset, but it can also be done on the fly. If you feel it runs too slow, then I suggest 1. resort your data by decending term within id, 2. DOW or Hash should be sufficient.

    data have;
    input id    term    (course1    course2          ) (:$8.);
    cards;
  21    201010  MAT        41                  
  21    201010  MAT        51                  
  21    201020  MAT        51                  
  21    201020  SPC        13                  
  29    201120  pos        94                  
  29    201130  pos        94                  
;

   PROC SQL;
CREATE TABLE WANT AS
SELECT *, CASE WHEN EXISTS(SELECT * FROM HAVE WHERE ID=A.ID AND 
(INT(TERM/100) = INT(A.TERM/100) AND MOD(TERM,100)=MOD(A.TERM,100)+10 
 OR INT(TERM/100) = INT(A.TERM/100)+1 AND MOD(TERM,100)=MOD(A.TERM,100)-20) 
 AND CATS(A.COURSE1,A.COURSE2) = CATS(COURSE1,COURSE2)) 
                  THEN 'Yes' ELSE 'No' END AS NEXTTERM
FROM HAVE A;
QUIT;

Here is a better solution. It sets up a Hash object in the first DOW to store all of the term, course information within the same ID, then in the second DOW to check if your condition meets. It does not require a sort if all of IDs stay together (clustered). For details, please refer to SAS Hash docs.

data have;
    input (id term course1 course2) (:$8.);
    cards;
21    201010  MAT        41                  
21    201010  MAT        51                  
21    201020  SPC        13 
21    201030  MAT        51 
21    201030  SPC        13 
29    201120  pos        94                  
29    201130  pos        94                  
;
run;



   data want;
if _n_=1 then do;
    dcl hash h();
    h.definekey('term','course1','course2');
    h.definedone();
end;
    do until (last.id);
        set have;
        by id notsorted;
        rc=h.add();
    end;

    length nextterm $3;

    do until (last.id);
        set have;
        by id notsorted;

        if h.check(key:cats(substr(term,1,4),input(substr(term,5),2.)+10),key:course1, key:course2) = 0 or 
            h.check(key:cats(input(substr(term,1,4),4.)+1,input(substr(term,5),2.)-20),key:course1, key:course2) =0 then
            nextterm='Yes';
        else nextterm = 'No';
        output;
    end;
    h.clear();
run;

edited Jun 19 '15 at 14:49

answered Jun 18 '15 at 15:25

Haikuo Bian

906
6
7

That's mostly right. But maybe I didn't mention right. The term line is like 201120-201130-201210. When the data pass from 201130 to 201210 it doesn't show that? How can I show that? Thanks for respond. I didn't know its look-ahead technique, couldn't find a name for call it. – joys Jun 18 '15 at 15:40
My apology. I didn't read it right, but then it brings more questions from my side: 1. How is your term numbers constructed? is it always 4 digits of year plus a number increased by 10? such as 201120-201130, and how many terms you have each year? 2. How does your data look like. Do you have occasionally 201110 jumping to 201130, meaning the student skips the whole term? And if that happens, do you consider 201130 to be the next term for 201110? Answers to those questions are critical to determine programming approaches. – Haikuo Bian Jun 18 '15 at 16:21
Each year has 3 terms. year + 10 or 20 or 30. So as following, 201010- 201020-201030-201110-201120-201130-201210-201220-,....so on. All terms are like this. No jumping, 201110 then 201120 – joys Jun 18 '15 at 17:13
Sorry but still not working. If there is two terms like 201030 then 201110, doesn't say 'Yes' – joys Jun 18 '15 at 18:03
I updated variables in the question. You can see the id = 29. The first one should be 'Yes'. Because there is next term with same id and courses. – joys Jun 18 '15 at 18:26
Can i put this (input id term (course1 course2 ) (:$8.);) part in to the proc sql? – joys Jun 19 '15 at 12:19
No, you can't. Are you looking for a way to convert your var type? You need to reveal more about your data. To start, what is the type of variable 'term'. Is it number or char? – Haikuo Bian Jun 19 '15 at 12:33
Also, I would suggest in the future your incoming data is presented as the way in the data step, so no guessing or mind reading is needed, and time will be spent more efficiently on your core issue. – Haikuo Bian Jun 19 '15 at 12:36
When I run it give me some errors: ERROR: Expression using subtraction (-) requires numeric types. – joys Jun 19 '15 at 12:48
Meaning your variable 'term' is character. I Could keep working on the proc sql approach, but it is getting difficult to read. Stay tuned, I am working on a Hash approach for you. – Haikuo Bian Jun 19 '15 at 13:11
Hash solution has been added. Please let me know how it works for you. – Haikuo Bian Jun 19 '15 at 13:58
No didn't work.ERROR: Hash object added 0 items when memory failure occurred. FATAL: Insufficient memory to execute DATA step program. Aborted during the EXECUTION phase. ERROR: The SAS System stopped processing this step because of insufficient memory. – joys Jun 19 '15 at 14:10
Not sure what happened, the code supposed to reconstruct a Hash object and release memory for every ID. I have the code updated to clear the memory explicitily. Run it again and let me know. Unless you have a very small memory and an ID with a very large group of obs, I can't see why it happened. What is your SAS version? – Haikuo Bian Jun 19 '15 at 14:52

score 0 · Answer 2 · answered Jun 18 '15 at 20:03

0

Here's another approach you can use - sort the dataset into reverse order, check to see whether each course carries over between terms, then sort it back into the original order:

data have;
    input id term (course1 course2) (:$8.);
    cards;
21    201010  MAT        41                  
21    201010  MAT        51                  
21    201020  MAT        51                  
21    201020  SPC        13                  
29    201120  pos        94                  
29    201130  pos        94                  
;
run;

proc sort data = have;
    by id course1 course2 descending term;
run;

data want;
    set have;
    by id course1 course2;
    length nextterm $3;
    nextterm = ifc(first.id or first.course2, 'No','Yes');
run;

proc sort data = want;
    by id term course1 course2;
run;

answered Jun 18 '15 at 20:03

user667489

9,501
2
24
35

it is a slick approach, however, it wouldn't work on the same courses that are skipping terms. 21 201010 MAT 51 21 201020 SPC 13 21 201030 MAT 51 – Haikuo Bian Jun 18 '15 at 21:50
For a more robust approach, I would suggest to setup an array or Hash object where you can temporally store the course information, then do a look up to confirm. – Haikuo Bian Jun 18 '15 at 21:55
The asker mentioned that they never expected to see skipping in one of their comments on your answer, so that shouldn't be a problem. – user667489 Jun 18 '15 at 22:19
The asker did confirm no student aka Id would skip a term, not for course. – Haikuo Bian Jun 18 '15 at 23:22
If indeed no courses can be skipped, then a simpler - not sorted approach may be available. Ultimately it is up to the OP. And I agree with the first commentator that OP's post is left much to improve, such as trying to exhaust all of the scenarios in the sample data. – Haikuo Bian Jun 18 '15 at 23:31

how can i create a variable depends on other observations

2 Answers2