0

I have a SQL Group By statement where I want to find the distinct substationcode and substationname with record count.

With the correct Group By, I should be able to see records that have count for distinct substationcode + substationname combination.
For example:

Source table:
substationcode substationname
ANDY           SUB:ANDY LAU
ANDY           SUB:CONS ANDY LAU
ACHM           SUB:ACHM
MIA            SUB:MIA LEONG
JON            SUB:JON LEE

Here are my codes:

proc sql;
create table twolayers as
select substationcode
,
substationname
,count(substationname) as cnt
from onlyscadadomsdistinct
group by substationcode, substationname
having cnt >1;
quit;

The result that I am hoping to get is that Andy will have cnt = 2. However, I see that ACHM has record cnt of 4. I don't get it. Which part of my group by statement is wrong?

I then filter substationcode "ACHM" to see the distinct substationname of "ACHM".
Only 1 record found which is SUB:ACHM

Where did ACHM CNT=4 comes from?

GSerg
  • 76,472
  • 17
  • 159
  • 346
B.Dick
  • 305
  • 2
  • 11
  • `I then filter substationcode "ACHM" to see the distinct substationname of "ACHM". Only 1 record found which is SUB:ACHM` - if you are querying `distinct substationname where substationcode = 'ACHM'`, then surely it will collapse all the `'SUB:ACHM'` records into one. Try removing the `distinct`. – GSerg Jan 07 '19 at 13:59
  • @GSerg The reason why i filter ACHM in where clause is because in the database, ACHM will have more than 1000 records. I want to see how many substationname it has. – B.Dick Jan 07 '19 at 14:09
  • Each combo of stationname and substationname in your example has a count of one (e.g. the combo "ANDY"/"SUB:ANDY LAU"). I suspect you have taken a shortcut when providing your sample data. Don't do that. It prevents people from understanding your use case. Provide data LITERALLY as it appears in your db. – Erwin Smout Jan 07 '19 at 14:15
  • @ErwinSmout i wish to post the data here but the db is having 80mil+ data. For substationcode ACHM specifically has more than 5000. I want to find how many different substationname it has. Is my group by correct if i were to find out how many different substationname each substationcode has? – B.Dick Jan 07 '19 at 14:31
  • 1
    @B.Dick No, it's not. It counts how many rows with non-null `substationname` each combination of `substationcode + substationname` has. Apparently you want `count(distinct substationname) as cnt`. – GSerg Jan 07 '19 at 14:36
  • your expectation is completely wrong as per the query you have written, Since you are grouping on stationcode and stationname your count will always be against the combination of these 2. As per the above sample set there will be no 2 against Andy. It will be Andy SUB:ANDY LAU =1 and Andy SUB:CONS ANDY LAU =1. Hope I clearead your doubt – Vishal Gupta Jan 07 '19 at 14:47
  • @GSerg I tried using count(distinct sustationame) as cnt However, this will make each and every record to have cnt = 1. – B.Dick Jan 07 '19 at 14:48
  • @GSerg and Vishal Instead, I am getting same substationcode but different substationname in different rows with cnt = 1. What if i want my result to show how many substationname does each substationcode has? – B.Dick Jan 07 '19 at 14:50
  • 1
    @B.Dick Of course it gives 1 for each pair because you have grouped by `substationcode` and `substationname`. Only group by `substationcode` and query `count(distinct substationname)`. – GSerg Jan 07 '19 at 14:57

1 Answers1

0

you should select and group by substationcode only as follows:

select substationcode

,count(substationcode) as cnt from onlyscadadomsdistinct group by substationcode having cnt >1;