3

I'm trying to find a biological "find core genome of a given set of organisms" problem. Essentially, given a list of organisms, find all genes that are common to them. To abstract away biology, you can think of find all colours that are favourite amongst a list of people (one person can have multiple favourite colours).

Database table would look like:

name | fav_colour
john | red
john | blue
john | green
jason | red
jason | blue
matt | red
matt | teal

User can specify a set of names like [john, jason] to get out [red, blue], or [john] to get [red, blue, green], or [john, jason, matt] to get [red].

I'm trying to solve this by doing n number of self joins where n is the number of names supplied.

Is there any way for me to do n number of self joins of tables to solve this problem for any number of names supplied? I tried to look for a way to do this via Postgres functions but can't figure out the n number of self joins part... any help or pointers towards the correct direction would be appreciated.

And no, unfortunately I can't change the schema to do these type of queries easier.

player87
  • 1,781
  • 1
  • 14
  • 21
  • 1
    The join approach is not what speaks to me....My first thought on this is to either use aggregation, counting how many organisms have each gene using group by and count(*) and then using having count = number of organisms. The other idea that comes to mind is stringing together queries for each organism and using INTERSECT to get the common elements. – joshp May 28 '17 at 02:07
  • *Exactly how* does a user "specify a set of names"? Can we assume they're in a table with one column, name? – philipxy Jun 04 '17 at 08:03
  • @philipxy I was hoping for a stored procedure that accepts an array of values which get parsed and counted. The count of items in array would be n which would be the number of self joins required. – player87 Jun 04 '17 at 20:33
  • Please edit that into your question. How else could it be answered? Also more about what you are trying to capture by "self-join". Eg another comment's "i.e. name columns are unique in the n tables". In relational algebra with renaming that's self-join *if the table rows are the same*, which they won't be, unless they're *constant hence trivial*. (Self-join just isn't relevant.) (See my edited answer.) Try to write an expression and/or example for what you are thinking of. PS WIth all the arrays, looping & counting you are apparently not thinking relationally. – philipxy Jun 04 '17 at 21:47
  • You want "relational division", dividing the name-colour table by a table of names. – philipxy Jun 10 '17 at 11:12

1 Answers1

4

I don't think you need self-joins for this. You can use aggregation and a HAVING clause:

with t(name, fav_colour) as (
      values ('john', 'red'),
             ('john', 'blue'),
             ('john', 'green'),
             ('jason', 'red'),
             ('jason', 'blue'),
             ('matt', 'red'),
             ('matt', 'teal')
     )
select fav_colour
from t
where name in ('john', 'jason')
group by fav_colour
having count(*) = 2;

The value "2" is the number of names in the list. Changing the IN list and the count are the only changes you need.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • How would this work if there are other columns as well? For example, find the favourite colour based on value in fav_colour column but also return age (two different ages in case of the John, Jason example) (if we can return age of one specific person that works too, i.e. return gene ID of a defined species). For my data, the common genes are found based on description column, but different species' genes have different IDs which can be used for further querying. – player87 May 28 '17 at 02:24
  • 2
    @player87 . . . If you have another question, then ask it as *another* question. This question is specifically about two columns, and changing the question could invalidate this answer and attract downvotes. But, if you just want ages from the matching rows, you can use `array_agg()`. – Gordon Linoff May 28 '17 at 02:26
  • Actually the array_agg() works for what I want to do. However, just a follow up to the original question, is it actually possible to do self join n times on a given condition (i.e. name columns are unique in the n tables)? – player87 May 28 '17 at 04:24
  • 1
    @player87 - Of course, formally you cannot do unknown number of self joins. In the cases you could need this, the solution would be to use a standard or custom aggregation function or a recursive query. See [this answer](https://stackoverflow.com/a/44283223/1995738) to see an example of such a *self-joining*. – klin Jun 10 '17 at 10:04