Parse only selected records from empty-line separated file

Question

I have a file with the following structure:

SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo

SE|text|Bla
SE|entity|Foo

SE|text|Zoo
SE|relation|Bla
SE|relation|Baz

Records (i.e., blocks) are separated by an empty line. Each line in a block starts with a SE tag. text tag always occurs in the first line of each block.

I wonder how to properly extract only blocks with a relation tag, which is not necessarily present in each block. My attempt is pasted below:

from itertools import groupby
with open('test.txt') as f:
    for nonempty, group in groupby(f, bool):
        if nonempty:
            process_block() ## ?

Desired output is a json dump:

{
    "result": [
        {
            "text": "Baz", 
            "relation": ["Bla","Foo"]
        },
        {
            "text": "Zoo", 
            "relation": ["Bla","Baz"]
        }

    ]
}

@sushanth I don't know exactly what do you mean. Could you please elaborate? — Andrej, Sep 13 '20 at 11:04
``relation`` key is duplicated, which is not allowed.. are you looking for ``'relation' : ['Bla', 'foo']..`` ? — sushanth, Sep 13 '20 at 11:06

Patrick Artner · Answer 1 · 2020-09-13T11:21:58.470

You can not store the same key twice in a dictionary as mentioned in the comments. You can read your file, split at '\n\n' into blocks, split blocks into lines at '\n', split lines into data at '|'.

You then can put it into a suiteable datastructure and parse it into a string using module json:

Create data file:

with open("f.txt","w")as f:
    f.write('''SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo

SE|text|Bla
SE|entity|Foo

SE|text|Zoo
SE|relation|Bla
SE|relation|Baz''')

Read data and process it:

with open("f.txt") as f:
    all_text = f.read()
    as_blocks = all_text.split("\n\n")
    # skip SE when splitting and filter only with |relation|
    with_relation = [[k.split("|")[1:]
                      for k in b.split("\n")]
                     for b in as_blocks if "|relation|" in b]

    print(with_relation)

Create a suiteable data structure - grouping multiple same keys into a list:

result = []
for inner in with_relation:
    result.append({})
    for k,v in inner:
        # add as simple key
        if k not in result[-1]:
            result[-1][k] = v

        # got key 2nd time, read it as list
        elif k in result[-1] and not isinstance(result[-1][k], list):
            result[-1][k] = [result[-1][k], v]

        # got it a 3rd+ time, add to list
        else:
            result[-1][k].append(v)

print(result)

Create json from data structure:

import json

print( json.dumps({"result":result}, indent=4))

Output:

# with_relation
[[['text', 'Baz'], ['entity', 'Bla'], ['relation', 'Bla'], ['relation', 'Foo']], 
 [['text', 'Zoo'], ['relation', 'Bla'], ['relation', 'Baz']]]

# result
[{'text': 'Baz', 'entity': 'Bla', 'relation': ['Bla', 'Foo']}, 
 {'text': 'Zoo', 'relation': ['Bla', 'Baz']}]

# json string
{
    "result": [
        {
            "text": "Baz",
            "entity": "Bla",
            "relation": [
                "Bla",
                "Foo"
            ]
        },
        {
            "text": "Zoo",
            "relation": [
                "Bla",
                "Baz"
            ]
        }
    ]
}

Jakob Guldberg Aaes · Accepted Answer · 2020-09-13T17:47:55.117

I have a proposed solution in pure python that returns a block if it contains the value in any position. This could most likely be done more elegant in a proper framework like pandas.

from pprint import pprint

fname = 'ex.txt'

# extract blocks
with open(fname, 'r') as f:
    blocks = [[]]
    for line in f:
        if len(line) == 1:
            blocks.append([])
        else:
            blocks[-1] += [line.strip().split('|')]

# remove blocks that don't contain 'relation
blocks = [block for block in blocks
          if any('relation' == x[1] for x in block)]

pprint(blocks)
# [[['SE', 'text', 'Baz'],
#   ['SE', 'entity', 'Bla'],
#   ['SE', 'relation', 'Bla'],
#   ['SE', 'relation', 'Foo']],
#  [['SE', 'text', 'Zoo'], ['SE', 'relation', 'Bla'], ['SE', 'relation', 'Baz']]]


# To export to proper json format the following can be done
import pandas as pd
import json
results = []
for block in blocks:
    df = pd.DataFrame(block)
    json_dict = {}
    json_dict['text'] = list(df[2][df[1] == 'text'])
    json_dict['relation'] = list(df[2][df[1] == 'relation'])
    results.append(json_dict)
print(json.dumps(results))
# '[{"text": ["Baz"], "relation": ["Bla", "Foo"]}, {"text": ["Zoo"], "relation": ["Bla", "Baz"]}]'

Let's go through it

Read the file into a list and divide each block by a blank line and divide columns with the | character.
Go through each block in the list and sort out any that does not contain relation.
Print the output.

Jan · Answer 3 · 2020-09-13T12:58:48.793

In my opinion this is a very good case for a small parser.
This solution uses a PEG parser called parsimonious but you could totally use another one:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import json

data = """
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo

SE|text|Bla
SE|entity|Foo

SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
"""


class TagVisitor(NodeVisitor):
    grammar = Grammar(r"""
        content = (ws / block)+

        block   = line+
        line    = ~".+" nl?
        nl      = ~"[\n\r]"
        ws      = ~"\s+"
    """)

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_content(self, node, visited_children):
        filtered = [child[0] for child in visited_children if isinstance(child[0], dict)]
        return {"result": filtered}

    def visit_block(self, node, visited_children):
        text, relations = None, []
        for child in visited_children:
            if child[1] == "text" and not text:
                text = child[2].strip()
            elif child[1] == "relation":
                relations.append(child[2])

        if relations:
            return {"text": text, "relation": relations}

    def visit_line(self, node, visited_children):
        tag1, tag2, text = node.text.split("|")
        return tag1, tag2, text.strip()


tv = TagVisitor()
result = tv.parse(data)

print(json.dumps(result))

This yields

{"result": 
    [{"text": "Baz", "relation": ["Bla", "Foo"]}, 
     {"text": "Zoo", "relation": ["Bla", "Baz"]}]
}

The idea is to phrase a grammar, build an abstract syntax tree out of it and return the block's content in a suitable data format.

Parse only selected records from empty-line separated file

3 Answers3