0

I'm working in SAS with an oracle table with a hierarchical codelist column, a COUNT column and a hierarchy indicator column and need to update COUNT column for the parent nodes based on the count of the childs under the node.

ID   level  code  count
1    1      1     (null)
2    2      15    (null)
3    3      156   1
4    2      18    (null)
5    3      181   1
6    3      182   1
7    1      2     (null)
8    2      20    (null)
9    3      205   1
10   3      206   1

NOTE:

  • the parent nodes should not be counted, only the lowest childs (SO code=1 is count=3, code=2 is count=2)

  • the "lowest" child nodes have a preset count of one, I need to update the NULL values.

  • the codelist is 1 to 10 characters long. So 9 hierarchical levels of parents.
  • I would prefer a PROC SQL workaround

I tried to create a temporary SAS dataset and UPDATE - SET = (SELECT COUNT) WHERE with a SUBSTR function in the select, but couldn't figure out how to make it work for every hierarchy level.

Martin
  • 27
  • 1
  • 7
  • Mention the desired output. – G.Arima Apr 26 '17 at 12:09
  • Hello. Thanks for replying. The desired output is the same table with updated values in the "count" column. The (null) values updated to the real counts in the table. ID=1 count=3, ID=2 count =1, ID=4 count=2 and so on. – Martin Apr 26 '17 at 12:33
  • I do not see how you know which node is the parent and which is the child. Perhaps by matching on the prefix of the CODE value? Is CODE a number or a character variable? – Tom Apr 26 '17 at 12:55
  • If you update the null values how will you then know which are the terminal nodes? – Tom Apr 26 '17 at 13:02
  • @Tom Exactly. Code is a character variable. And every elements parent node is the element with the length of (n-1) and all characters matching. code="1" is parent to code="11" and code="12" etc. while code="12" is parent to code="121" and code="122". But this is not a typical hierarchy, because the parents with the length of 1 to 9 should not be counted as childs in the count column. The real childs are only with the length of 10. So matematically the SUM of counts in one level = SUM of counts in other levels. Not sure if I explained it clearly though :/ – Martin Apr 26 '17 at 13:03

1 Answers1

0

Just do a self join. This example uses length of 3 as a terminal node, but if your real data uses 10 then just change the 3's to 10's.

proc sql ;
 create table want as 
 select a.id,a.level,a.code
      , (a.level=3) as Terminal_Node
      , sum(b.level=3) as Count_Terminal_Nodes
 from have a full join have b
 on a.code = substr(b.code,1,a.level)
 group by a.id,a.level,a.code
 order by 1
;
quit;

enter image description here

Tom
  • 47,574
  • 2
  • 16
  • 29
  • Great solution. Thank you! – Martin Apr 26 '17 at 14:11
  • If you are pushing into external database you might want to convert the boolean expressions into CASE statement to make them portable. – Tom Apr 26 '17 at 14:18
  • Thanks, but unfortunately I do not have privileges to create directly in the DB, so I will just do all the processing in EG and then do an update on the DB table. – Martin Apr 27 '17 at 09:24