1

I have a groups.txt file which contains ortholog groups with species and geneID in every groups. it looks like :

OG_117996: R_baltica_p|32476565 V_spinosum_v|497645257
OG_117997: R_baltica_p|32476942 S_pleomorpha_s|374317197
OG_117998: R_baltica_p|32477405 V_bacterium_v|198258541

I made a function that created a list of every species in the whole file (66 total) called listOfAllSpecies. I need to create a function that gives me all the groups which contain 1 species from these 66, then all the groups which contain 2 species from these 66, etc.

To simplify it :

OG_1: A|1 A|3 B|1 C|2
OG_2: A|4 B|6
OG_3: C|8 B|9 A|10

and I need to get in this example:

(species) A,B (are in groups) OG_1, OG_2, OG_3
(species) A,C (are in groups) OG_1, OG_3
(species) B,C (are in groups) OG_1, OG_2, OG_3
(species) A,B,C (are in groups) OG_1, OG_3
(species) B (is in groups) OG_1, OG_2, OG_3

I thought to try

for species in range(start, end=None):         
    if end == None:           
        start = 0
        end = start + 1

to get the first species in my listOfAllSpecies and then tell me in which groups OG_XXXX it is contained. Then get the first and the second species, etc. until it takes all the 66 species. How do I modify the range within the for loop, or is there a different way to do this?

here is my actual code with function that i need without the part I need that i asked :

import sys 

if len(sys.argv) != 2:
print("Error, file name to open is missing")
sys.exit([1])

def readGroupFile(groupFileName):
dict_gene_taxonomy = {}
fh = open(groupFileName,"r")

for line in fh:
    liste = line.split(": ")
    groupName = liste[0]
    genesAsString = liste[1]
    dict_taxon = {}
    liste_gene = genesAsString.split()

    for item in liste_gene:
        taxonomy_gene = item.split("|")
        taxonomy = taxonomy_gene[0]
        geneId   = taxonomy_gene[1]

        if not taxonomy in dict_taxon:
            dict_taxon[taxonomy] = []

        dict_taxon[taxonomy].append(geneId)

    dict_gene_taxonomy[groupName] = dict_taxon
fh.close()
return dict_gene_taxonomy


def showListOfAllSpecies(dictio):
listAllSpecies = []
for groupName in dictio:
    dictio_in_dictio = dictio[groupName]
    for speciesName in dictio_in_dictio:
        if not speciesName in listAllSpecies:
            listAllSpecies.append(speciesName)
return listAllSpecies

dico = readGroupFile(sys.argv[1])
listAllSpecies = showListOfAllSpecies(dico)
  • So to clarify, `OG_117996` this is the group, these are the species: `R_baltica_p`, `V_spinosum_v` and these are the species id's `32476565`, `497645257`? – kylieCatt Jun 05 '15 at 13:15
  • I'm not sure what your question is, but maybe you need `itertools.combinations`. – Kevin Jun 05 '15 at 13:16
  • to IanAuld : Hi, Yes exactly. but i don't think that the geneID or as you called it "species id's" is really needed to reach my aim. I only need species and groups name. – Arnaud 'KaRn1zC' Jun 05 '15 at 13:18
  • He was joking. You can't mark a question as more important than any other question. Unless you count bounties, but you don't have the reputation necessary to make one. – Kevin Jun 05 '15 at 13:19
  • It's equally trivial to include them so you might as well. Now for the million dollar question, what have you tried and what results are you getting? – kylieCatt Jun 05 '15 at 13:19
  • to Kevin : Hi, i'm gonna try something with the itertools.combinations you told me. I didn't think it was a joke for the "urgent" thing, so you did well by telling me it was ^^ – Arnaud 'KaRn1zC' Jun 05 '15 at 13:22

3 Answers3

3

Not sure if this is exactly what you want, but it's a start :)

from itertools import combinations

# Assume input is a list of strings called input_list
input_list = ['OG_1: A|1 A|3 B|1 C|2','OG_2: A|4 B|6','OG_3: C|8 B|9 A|10']

# Create a dict to store relationships and a list to store OGs
rels = {}
species = set()

# Populate the dict
for item in input_list:
    params = item.split(': ')
    og = params[0]
    raw_species = params[1].split()
    s = [rs.split('|')[0] for rs in raw_species]
    rels[og] = s

    for item in s:
        species.add(item)

# Get the possible combinations of species:
combos = [c for limit in range(1, len(l)-1) for c in combinations(species,limit)]

def combo_in_og(combo, og):
    for item in combo:
        if item not in rels[og]:
            return False
    return True

# Loop over the combinations and print
for combo in combos:
    valid_ogs = []
    for og in ogs:
        if combo_in_og(combo, og):
            valid_ogs.append(og)
    print('(species) ' + ','.join(combo) + ' (are in groups) ' + ', '.join(valid_ogs))

Produces:

(species) C (are in groups) OG_1, OG_3
(species) A (are in groups) OG_1, OG_2, OG_3
(species) B (are in groups) OG_1, OG_2, OG_3
(species) C,A (are in groups) OG_1, OG_3
(species) C,B (are in groups) OG_1, OG_3
(species) A,B (are in groups) OG_1, OG_2, OG_3
(species) C,A,B (are in groups) OG_1, OG_3

Just a warning: what you're trying to do will start to take forever with large enough numbers of inputs, as its complexity is 2^N. You can't get around it (that's what the problem demands), but it's there.

Community
  • 1
  • 1
Rob Grant
  • 7,239
  • 4
  • 41
  • 61
  • Hi, thank you so much for your answer. I'm gonna try it right now and tell you what happen – Arnaud 'KaRn1zC' Jun 05 '15 at 14:11
  • Well, i think you gave the answer to my problem. But the thing is, I've never seen (for now) more than half terms that you used, so my internship's Master will not accept that as a result. However it's a good way to teach me new things in Python that I'll need in my future, so I'm very thankful for that. – Arnaud 'KaRn1zC' Jun 05 '15 at 15:49
  • @Arnaud'KaRn1zC' glad it helped. Try looking up [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) and [itertools.combinations](https://docs.python.org/3/library/itertools.html#itertools.combinations). And happy learning :) – Rob Grant Jun 05 '15 at 16:46
-1

What about using a while loop to control the range() parameters?

end = 0
start = 0
while end < 1000:
    for species in range(start, end):         
        ...do something

    end += 1
alec_djinn
  • 10,104
  • 8
  • 46
  • 71
  • Hi, i tried what you said but it doesn't seem to work, it says : end += 1 ^ TabError: inconsistent use of tabs and spaces in indentation thank you for answering by the way. – Arnaud 'KaRn1zC' Jun 05 '15 at 13:44
  • It means that the indentation in your code is not consistent. If you use tab to indent your code, then it will rise an error if it suddenly finds a block of code indented using spaces. You must be consistent, it happens when you copy paste code. Check if spaces and tabs are mixed in your code, if so just correct it and it should work. – alec_djinn Jun 05 '15 at 13:49
-1

The list of all non-empty subsets of a set of N items (your set of all species) is 2N – 1

That's because it is just like a binary number of N bits, where each bit can be 1 (take that species in the subset) or 0 (exclude that species from the subset.) The -1 excludes the empty set (all bits 0)

Therefore you can enumerate all the subsets of species with a simple integer loop:

# sample data
listOfAllSpecies = ['A', 'B', 'C']

# enumerate all subsets of listOfAllSpecies, 0 excluded (the empty set)
for bits in range(1, 2**len(listOfAllSpecies)):

    # build the subset
    subset = []
    for n in range(len(listOfAllSpecies)):
        # test if the current subset includes bit n
        if bits & 2**n:
            subset.append(listOfAllSpecies[n])

    # see which groups contain the given subset
    print "species", ",".join(subset), "are in groups TODO"

Result:

species A are in groups TODO
species B are in groups TODO
species A,B are in groups TODO
species C are in groups TODO
species A,C are in groups TODO
species B,C are in groups TODO
species A,B,C are in groups TODO

If you also need the code to test if a group contains a subset, you need to specify how the groups are stored in your program.

If this post answers your question, you should click the green checkmark ✔ on the top left corner.

Tobia
  • 17,856
  • 6
  • 74
  • 93
  • Hi, first of all thank you for your answer. I tried to use what you wrote, the problem is that it seem to create an infinite loop. Moreover, it doesn't compare my list of species to every groups in my groups.txt file and it also doesn't send me back the name of the groups which contain the A species, then the A,B species, then the A,B,C species, etc. because It always write "TODO". Thank you by the way. – Arnaud 'KaRn1zC' Jun 05 '15 at 13:36
  • It cannot create an infinite loop, because it only contains two finite `for` loops. It's probably just a very long loop. As I wrote above, if you need the code to test if a group contains a subset, you need to specify how the groups are stored in your program. Parsing a text file is beyond the scope of this question, so you should already have the groups as a data structure in your program (a list, a dict, or such.) – Tobia Jun 05 '15 at 15:18
  • Thank you for your answers. I edited my post to add my code with functions that i made for dict and list( = listAllSpecies). I'll keep thinking about the way to solve my problems with the tips that you gave me. – Arnaud 'KaRn1zC' Jun 05 '15 at 15:52