0

I'm receiving more ID's then possible after performing a left outer join statement on two tables.

Context: I've narrowed down two tables from larger data sets. Below are the queries found I've used to obtain IDs possible after selecting from the larger data set. I am trying to JOIN two data sets together based upon the personID. I am preforming a LEFT JOIN because I want to retain all data in the smallerset table while pairing it with data from narrowed data table.

Upon running this query on a subset of a larger data set I receive the following IDs:

select s.personID from st.smallerset as s;

41486
41490
41493
41496
 ...

Now, upon running the second query, I receive the following:

select r.* from st.secondlargedataset as r,
st.condition as h where r.personID=h.personID group by r.personID;

I receive the following queries:

41544
41577
41595
41605
 ...

So far, so good. Everything is as expected. However, when I run this next subquery as I try to join these two subsets I get unexpected results.

select s.personID
from st.smallerset as s left join
(select r.* from st.secondlargedataset as r,
 st.condition as h where r.personID=h.personID group by r.personID) as v on 
s.personID=v.personID;

41485
41486
41489
41490
41493
41496
41510
41512
41513
41516

Included in this result is data that is not include in either prior query. Specifically IDs 41485, 41489, 41496 et cetera. I'm clueless as to why these IDs are appearing since when I break the queries into separate queries I get the IDs I need.

Andriy M
  • 76,112
  • 17
  • 94
  • 154
coatless
  • 20,011
  • 13
  • 69
  • 84
  • Why are you mixing join styles? The subquery is using the lazy style... – Marc B Apr 20 '12 at 02:55
  • st is a sas libname reference Forgive me, I'm learning PROC SQL and I've broken up what I need to do into individual queries that when combined via subqueries I *THOUGHT* should yield the right result. h.personID is a separate data table that has a list of people that need to be found in secondlargedataset. – coatless Apr 20 '12 at 03:24
  • There is no reasonable way we can help solve this problem without sample data..from both data sets to be joined – Jay Corbett Apr 20 '12 at 13:54
  • 2
    Are you positive the IDs 41485, 41489, 41496 were absent from the previous result sets? None of your queries posted here contains an `ORDER BY` clause, so the "mystical" values could be e.g. somewhere at the end in one of the earlier result sets, you may have overlooked them. – Andriy M Apr 20 '12 at 18:12

2 Answers2

1

I think that your problem may be with this query:

select r.*
from st.secondlargedataset as r, st.condition as h
where r.personID=h.personID
group by r.personID;

The group by statement would be illegal in most SQL implementation, but SAS allows it (even though it's meaningless), and produces unpredictable results.

Try removing it.

stevepastelan
  • 1,264
  • 7
  • 11
  • 1
    You can also try shortening it further to `select r.personID from st.secondlargedataset as r, st.condition as h where r.personID=h.personID`, since it looks like you're discarding the rest of the fields anyway in your main select. – stevepastelan Apr 20 '12 at 14:34
0

try the following

SELECT s.personID FROM st.smallerset AS s WHERE s.personID IN ( SELECT r.personID FROM st.secondlargedataset AS r, st.condition AS h WHERE r.personID = h.personID GROUP BY r.personID )

OR

SELECT s.personID FROM st.smallerset AS s WHERE s.personID IN ( SELECT r.personID FROM st.secondlargedataset AS r INNER JOIN st.condition AS h ON r.personID = h.personID GROUP BY r.personID )

SarjanWebDev
  • 523
  • 3
  • 11
  • This does not yield the expected results since the table of st.smallerset is not kept in tack. In fact, there are less observations than the initial query I constructed. I'm trying something else. – coatless Apr 20 '12 at 04:03
  • The result of that query is: 41577 41595 41605 41607 41636 41666 41672 41727 41755 – coatless Apr 20 '12 at 05:15