How to parse KEGG module (md:) flat files into Pandas DataFrame in Python? (i.e., parse text by section reading line-by-line only once)

Question

I'm trying to write a parser for this type of file but it's proven to a bit more tricky than I thought. If this was a few years ago, I would just read the file multiple times but now I think about efficiency and that is no bueno as there must be a better way.

How can I parse each section while reading the file line-by-line only once?

My attempt was by setting a variable that is changed depending on the section. This started working but as you can see it gets messed up on the second grouping (e.g. REACTION). I feel like while loops might be the answer but I am not sure how to implement that in this context.

from io import StringIO
f = StringIO(
"""
ENTRY       M00001            Pathway   Module
NAME        Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
DEFINITION  (K00844,K12407,K00845,K00886,K08074,K00918) (K01810,K06859,K13810,K15916) (K00850,K16370,K21071,K00918) (K01623,K01624,K11645,K16305,K16306) K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406)
ORTHOLOGY   K00844,K12407,K00845  hexokinase/glucokinase [EC:2.7.1.1 2.7.1.2] [RN:R01786]
            K00886  polyphosphate glucokinase [EC:2.7.1.63] [RN:R02189]
            K08074,K00918  ADP-dependent glucokinase [EC:2.7.1.147] [RN:R09085]
            K01810,K06859,K13810,K15916  glucose-6-phosphate isomerase [EC:5.3.1.9] [RN:R02740]
            K00850,K16370,K21071  6-phosphofructokinase [EC:2.7.1.11] [RN:R04779]
            K00918  ADP-dependent phosphofructokinase [EC:2.7.1.146] [RN:R09084]
            K01623,K01624,K11645,K16305,K16306  fructose-bisphosphate aldolase [EC:4.1.2.13] [RN:R01070]
            K01803  triosephosphate isomerase [EC:5.3.1.1] [RN:R01015]
            K00134,K00150  glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12 1.2.1.59] [RN:R01061 R01063]
            K00927  phosphoglycerate kinase [EC:2.7.2.3] [RN:R01512]
            K11389  glyceraldehyde-3-phosphate dehydrogenase (ferredoxin) [EC:1.2.7.6] [RN:R07159]
            K01834,K15633,K15634,K15635  phosphoglycerate mutase [EC:5.4.2.11 5.4.2.12] [RN:R01518]
            K01689  enolase [EC:4.2.1.11] [RN:R00658]
            K00873,K12406  pyruvate kinase [EC:2.7.1.40] [RN:R00200]
CLASS       Pathway modules; Carbohydrate metabolism; Central carbohydrate metabolism
PATHWAY     map00010  Glycolysis / Gluconeogenesis
            map01200  Carbon metabolism
            map01100  Metabolic pathways
REACTION    R01786,R02189,R09085  C00267 -> C00668
            R02740  C00668 -> C05345
            R04779,R09084  C05345 -> C05378
            R01070  C05378 -> C00111 + C00118
            R01015  C00111 -> C00118
            R01061,R01063  C00118 -> C00236
            R01512  C00236 -> C00197
            R07159  C00118 -> C00197
            R01518  C00197 -> C00631
            R00658  C00631 -> C00074
            R00200  C00074 -> C00022
COMPOUND    C00267  alpha-D-Glucose
            C00668  alpha-D-Glucose 6-phosphate
            C05345  beta-D-Fructose 6-phosphate
            C05378  beta-D-Fructose 1,6-bisphosphate
            C00111  Glycerone phosphate
            C00118  D-Glyceraldehyde 3-phosphate
            C00236  3-Phospho-D-glyceroyl phosphate
            C00197  3-Phospho-D-glycerate
            C00631  2-Phospho-D-glycerate
            C00074  Phosphoenolpyruvate
            C00022  Pyruvate
///
"""
)


kegg_definition = None
kegg_orthology = list()
kegg_ortholog_set = set()
kegg_class = None
kegg_pathways = list()
kegg_pathway_set = set()
kegg_reactions = list()
kegg_reaction_set = set()
kegg_compounds = list()
kegg_compound_set = set()

# Read KEGG module text
parsing = None
for line_level_1 in f:
    line_level_1 = line_level_1.strip()
    if not line_level_1.startswith("/"):
        # Get orthologs
        if line_level_1.startswith("DEFINITION"):
            kegg_definition = line_level_1.replace("DEFINITION","").strip()
            kegg_ortholog_set = str(kegg_definition)
            for character in list("(+ -)"):
                kegg_ortholog_set.replace(character, ",")
            kegg_ortholog_set = set(filter(bool, kegg_ortholog_set.split(",")))
            parsing = "ORTHOLOGY"
        if parsing == "ORTHOLOGY":
            ko_annot = line_level_1.replace("DEFINITION","").strip()
            kegg_orthology.append(ko_annot)
        # Get class
        if line_level_1.startswith("CLASS"):
    #         parsing = None
            kegg_class = line_level_1.replace("CLASS","").strip().split("; ")
        # Get pathways
        if line_level_1.startswith("PATHWAY"):
            parsing = "PATHWAY"
        if parsing == "PATHWAY":
            kegg_pathway = line_level_1.replace("PATHWAY","").strip().split("; ")
            kegg_pathways.append(kegg_pathway)
        # Get reactions
        if line_level_1.startswith("REACTION"):
            parsing = "REACTION"
        if parsing == "REACTION":
            kegg_reaction = line_level_1.replace("REACTION","").strip().split("; ")
            kegg_reactions.append(kegg_reaction)
        # Get compounds
        if line_level_1.startswith("COMPOUND"):
            parsing = "COMPOUND"
        if parsing == "COMPOUND":
            kegg_compound = line_level_1.replace("COMPOUND","").strip().split("; ")
            kegg_compounds.append(kegg_compound)
            
kegg_pathways
# [['map00010  Glycolysis / Gluconeogenesis'],
#  ['map01200  Carbon metabolism'],
#  ['map01100  Metabolic pathways'],
#  ['REACTION    R01786,R02189,R09085  C00267 -> C00668']]

For example pyparsing: https://stackoverflow.com/questions/1776185/advice-on-python-parser-generators — mkrieger1, Mar 18 '21 at 20:41
Concerning your current issue: you wrote "As you can see it gets messed up [in] REACTION". How should we see that? What did you expect to happen and what happens instead? — mkrieger1, Mar 18 '21 at 20:47
It's always frustrating to answer questions of the form "I want to parse a file like this", where "like this" is explained with a single example. "Like this" says nothing about the other possible options (more keywords/sections/attributes or whatever those are? maybe some are optional? Is the order fixed? Is whitespace strict? (Are 2 spaces different from 1 space)? somewhat significant? only matters to human readers? And a whole bunch more... If you have a specific question about parsing, provide enough details so that the precise file format is not needed. ... — rici, Mar 18 '21 at 21:29
If you do have a precise specification, it should cost little to include a link in the question, in case some detail in the spec happens to be important. (This isn't an offer to handle out-of-scope questions like write-my-project (too broad) or gimme-a-tool (too opinionated).) — rici, Mar 18 '21 at 21:33

score 1 · Answer 1 · answered Mar 18 '21 at 20:46

It's because you don't check for REACTION until after you've already gathered the line into PATHWAY. Do all of your "next section?" checking before you start using the "parsing" value:

# Read KEGG module text
parsing = None
for line_level_1 in f:
    line_level_1 = line_level_1.strip()
    if not line_level_1.startswith("/"):
        if line_level_1.startswith("DEFINITION"):
            kegg_definition = line_level_1.replace("DEFINITION","").strip()
            kegg_ortholog_set = str(kegg_definition)
            for character in list("(+ -)"):
                kegg_ortholog_set.replace(character, ",")
            kegg_ortholog_set = set(filter(bool, kegg_ortholog_set.split(",")))
        elif line_level_1.startswith("CLASS"):
            kegg_class = line_level_1.replace("CLASS","").strip().split("; ")
        elif line_level_1.startswith("ORTHOLOGY"):
            parsing = "ORTHOLOGY"
        elif line_level_1.startswith("PATHWAY"):
            parsing = "PATHWAY"
        elif line_level_1.startswith("REACTION"):
            parsing = "REACTION"
        elif line_level_1.startswith("COMPOUND"):
            parsing = "COMPOUND"
        
        if parsing == "ORTHOLOGY":
            ko_annot = line_level_1.replace("DEFINITION","").strip()
            kegg_orthology.append(ko_annot)
        elif parsing == "PATHWAY":
        # Get pathways
            kegg_pathway = line_level_1.replace("PATHWAY","").strip().split("; ")
            kegg_pathways.append(kegg_pathway)
        # Get reactions
        elif parsing == "REACTION":
            kegg_reaction = line_level_1.replace("REACTION","").strip().split("; ")
            kegg_reactions.append(kegg_reaction)
        # Get compounds
        elif parsing == "COMPOUND":
            kegg_compound = line_level_1.replace("COMPOUND","").strip().split("; ")
            kegg_compounds.append(kegg_compound)
            
print(kegg_pathways)

O.rka · Accepted Answer · 2021-03-23T20:33:24.910

Just in case this helps anyone in the future. Here's the parser I made. However, if you are looking to calculate module completion ratios from a group of orthologs (e.g. a metagenome assembled genome or bin) then you need to take into consideration the logicals in the KEGG definition and the fact that some modules are bifurcated...in which you should probably use some of the code in MicrobeAnnotator

import datetime
from Bio.KEGG.REST import kegg_list, kegg_get

# Parse KEGG Module
def parse_kegg_module(module_file):
    """
    
    Example of a KEGG REST module file:
    
    ENTRY       M00001            Pathway   Module
    NAME        Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
    DEFINITION  (K00844,K12407,K00845,K00886,K08074,K00918) (K01810,K06859,K13810,K15916) (K00850,K16370,K21071,K00918) (K01623,K01624,K11645,K16305,K16306) K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406)
    ORTHOLOGY   K00844,K12407,K00845  hexokinase/glucokinase [EC:2.7.1.1 2.7.1.2] [RN:R01786]
                K00886  polyphosphate glucokinase [EC:2.7.1.63] [RN:R02189]
                K08074,K00918  ADP-dependent glucokinase [EC:2.7.1.147] [RN:R09085]
                K01810,K06859,K13810,K15916  glucose-6-phosphate isomerase [EC:5.3.1.9] [RN:R02740]
                K00850,K16370,K21071  6-phosphofructokinase [EC:2.7.1.11] [RN:R04779]
                K00918  ADP-dependent phosphofructokinase [EC:2.7.1.146] [RN:R09084]
                K01623,K01624,K11645,K16305,K16306  fructose-bisphosphate aldolase [EC:4.1.2.13] [RN:R01070]
                K01803  triosephosphate isomerase [EC:5.3.1.1] [RN:R01015]
                K00134,K00150  glyceraldehyde 3-phosphate dehydrogenase [EC:1.2.1.12 1.2.1.59] [RN:R01061 R01063]
                K00927  phosphoglycerate kinase [EC:2.7.2.3] [RN:R01512]
                K11389  glyceraldehyde-3-phosphate dehydrogenase (ferredoxin) [EC:1.2.7.6] [RN:R07159]
                K01834,K15633,K15634,K15635  phosphoglycerate mutase [EC:5.4.2.11 5.4.2.12] [RN:R01518]
                K01689  enolase [EC:4.2.1.11] [RN:R00658]
                K00873,K12406  pyruvate kinase [EC:2.7.1.40] [RN:R00200]
    CLASS       Pathway modules; Carbohydrate metabolism; Central carbohydrate metabolism
    PATHWAY     map00010  Glycolysis / Gluconeogenesis
                map01200  Carbon metabolism
                map01100  Metabolic pathways
    REACTION    R01786,R02189,R09085  C00267 -> C00668
                R02740  C00668 -> C05345
                R04779,R09084  C05345 -> C05378
                R01070  C05378 -> C00111 + C00118
                R01015  C00111 -> C00118
                R01061,R01063  C00118 -> C00236
                R01512  C00236 -> C00197
                R07159  C00118 -> C00197
                R01518  C00197 -> C00631
                R00658  C00631 -> C00074
                R00200  C00074 -> C00022
    COMPOUND    C00267  alpha-D-Glucose
                C00668  alpha-D-Glucose 6-phosphate
                C05345  beta-D-Fructose 6-phosphate
                C05378  beta-D-Fructose 1,6-bisphosphate
                C00111  Glycerone phosphate
                C00118  D-Glyceraldehyde 3-phosphate
                C00236  3-Phospho-D-glyceroyl phosphate
                C00197  3-Phospho-D-glycerate
                C00631  2-Phospho-D-glycerate
                C00074  Phosphoenolpyruvate
                C00022  Pyruvate
    ///
    """

    kegg_module = None
    kegg_name = None
    kegg_definition = None
    kegg_orthology = list()
    kegg_ortholog_set = set()
    kegg_classes = list()
    kegg_pathways = list()
    kegg_pathway_set = set()
    kegg_reactions = list()
    kegg_reaction_set = set()
    kegg_compounds = list()
    kegg_compound_set = set()

    # Read KEGG module text
    parsing = None
    for line in module_file:
        line = line.strip()
        if not line.startswith("/"):
            if not line.startswith(" "):
                first_word = line.split(" ")[0]
                if first_word.isupper() and first_word.isalpha():
                    parsing = first_word
            if parsing == "ENTRY":
                kegg_module = list(filter(bool, line.split(" ")))[1]
            if parsing == "NAME":
                kegg_name = line.replace(parsing, "").strip()
                parsing = None
            if parsing == "DEFINITION":
                kegg_definition = line.replace(parsing,"").strip()
                kegg_ortholog_set = str(kegg_definition)
                for character in list("(+ -)"):
                    kegg_ortholog_set = kegg_ortholog_set.replace(character, ",")
                kegg_ortholog_set = set(filter(bool, kegg_ortholog_set.split(",")))
            if parsing == "ORTHOLOGY":
                kegg_orthology.append(line.replace(parsing,"").strip())
            if parsing == "CLASS":
                kegg_classes = line.replace(parsing,"").strip().split("; ")
            if parsing == "PATHWAY":
                kegg_pathway = line.replace(parsing,"").strip()
                kegg_pathways.append(kegg_pathway)
                id_pathway = kegg_pathway.split(" ")[0]
                kegg_pathway_set.add(id_pathway)
            if parsing == "REACTION":
                kegg_reaction = line.replace(parsing,"").strip()
                kegg_reactions.append(kegg_reaction)
                for id_reaction in kegg_reaction.split(" ")[0].split(","):
                    kegg_reaction_set.add(id_reaction)
            if parsing == "COMPOUND":
                kegg_compound = line.replace(parsing,"").strip()
                id_compound = kegg_compound.split(" ")[0]
                kegg_compounds.append(kegg_compound)
                kegg_compound_set.add(id_compound)

    module_info = pd.Series(
        data = OrderedDict([
            ("NAME",kegg_name),
            ("DEFINITION",kegg_definition),
            ("ORTHOLOGY",kegg_orthology),
            ("ORTHOLOGY_SET",kegg_ortholog_set),
            ("CLASS",kegg_classes),
            ("PATHWAY",kegg_pathways),
            ("PATHWAY_SET",kegg_pathway_set),
            ("REACTION",kegg_reactions),
            ("REACTION_SET",kegg_reaction_set),
            ("COMPOUND",kegg_compounds),
            ("COMPOUND_SET",kegg_compound_set),
        ]),
        name=kegg_module,
    )
            
    return module_info


def get_kegg_modules(expand_nested_modules=True):
    results = list()
    for line in list(kegg_list("module")):
        line = line.strip()
        module, name = line.split("\t")
        prefix, id_module = module.split(":")
        module_file = kegg_get(module)
        module_info = parse_kegg_module(module_file)
        results.append(module_info)
    df = pd.DataFrame(results)
    df.index.name = datetime.datetime.now().strftime("Accessed: %Y-%m-%d @ %H:%M")
    
    # Expand nested modules
    if expand_nested_modules:
        for id_module, row in df.iterrows():
            kegg_orthology_set = row["ORTHOLOGY_SET"]
            expanded = set()
            for x in kegg_orthology_set:
                if x.startswith("K"):
                    expanded.add(x)
                if x.startswith("M"):
                    for id_ko in df.loc[x,"ORTHOLOGY_SET"]:
                        expanded.add(id_ko)
            df.loc[id_module, "ORTHOLOGY_SET"] = expanded
    return df

How to parse KEGG module (md:) flat files into Pandas DataFrame in Python? (i.e., parse text by section reading line-by-line only once)

2 Answers2