4

This one is tricky... There are a number of questions and answers on how to traverse trees, but I could not adapt any of the proposed solutions to my special case. My problem is pretty close to Python: How can I filter a n-nested dict of dicts by leaf value?

I have JSON data with a special nested structure (subterms, synonyms, name, id) which can be of arbitrary depth.

tree=[{'id': 20, 'name': 'education', 'subterms': [
               {'id': 21, 'name': 'schools', 'synonyms': []},
               {'id': 22, 'name': 'schoolbooks', 'synonyms': ['literature']},
               {'id': 23, 'name': 'higher education', 'synonyms': ['university']},
               {'id': 25, 'name': 'conference', 'synonyms': ['lecture']}]},
 {'id': 26, 'name': 'health', 'subterms': [
               {'id': 27, 'name': 'health issues', 'synonyms': []},
               {'id': 28, 'name': 'nutrition', 'synonyms': []},
               {'id': 29, 'name': 'medicine', 'synonyms': []}]},
 {'id': 1, 'name': 'business', 'subterms': [{'id': 2,
                'name': 'industry',
                'subterms': [{'id': 21, 'name': 'service', 'synonyms': []},
                             {'id': 21, 'name': 'agriculture', 'synonyms': []}],
                'synonyms': []},
               {'id': 3, 'name': 'professions', 'synonyms': ['jobs']}]}]

My aim is to filter this tree by matches for 'name' and 'synonyms'. The branch hierarchy of a matching term has to be preserved: A matching subterm on level 3 would mean that the parent terms on levels 1 and 2 are also preserved (but not the subterms).

For example the use of filterterms=['literature', 'agriculture'] should result in the following filtered tree:

[{'id': 20, 'name': 'education', 'subterms': [
               {'id': 22,'name': 'schoolbooks', 'synonyms': ['literature']}]},
 {'id': 1, 'name': 'business', 'subterms': [{'id': 2, 'name': 'industry',
                'subterms': [{'id': 21, 'name': 'agriculture', 'synonyms': []}],
                'synonyms': []}]}]

All my attempts to traverse the tree on n-levels and preserve the branch hierarchy for the matching terms have so far failed badly... Any help on how I can solve this task?

Community
  • 1
  • 1
boadescriptor
  • 735
  • 2
  • 9
  • 29

1 Answers1

3

I think this does what you want.

tree=[{'id': 20, 'name': 'education', 'subterms': [
               {'id': 21, 'name': 'schools', 'synonyms': []},
               {'id': 22, 'name': 'schoolbooks', 'synonyms': ['literature']},
               {'id': 23, 'name': 'higher education', 'synonyms': ['university']},
               {'id': 25, 'name': 'conference', 'synonyms': ['lecture']}]},
 {'id': 26, 'name': 'health', 'subterms': [
               {'id': 27, 'name': 'health issues', 'synonyms': []},
               {'id': 28, 'name': 'nutrition', 'synonyms': []},
               {'id': 29, 'name': 'medicine', 'synonyms': []}]},
 {'id': 1, 'name': 'business', 'subterms': [{'id': 2,
                'name': 'industry',
                'subterms': [{'id': 21, 'name': 'service', 'synonyms': []},
                             {'id': 21, 'name': 'agriculture', 'synonyms': []}],
                'synonyms': []},
               {'id': 3, 'name': 'professions', 'synonyms': ['jobs']}]}]

def filter_by_name(node, names):
    if isinstance(node, list):
        return filter(None, (filter_by_name(x, names) for x in node if x))
    subterms = filter(None, filter_by_name(node.get('subterms',[]), names))
    if set([node['name']]+node.get('synonyms',[])).intersection(names):
        return dict(node, subterms=subterms)
    if subterms:
        return dict(node, subterms=subterms)
    return None


from pprint import pprint
pprint(filter_by_name(tree, ['business']))
pprint(filter_by_name(tree, ['literature']))
pprint(filter_by_name(tree, ['literature', 'agriculture']))

Result:

[{'id': 1, 'name': 'business', 'subterms': []}]
[{'id': 20,
  'name': 'education',
  'subterms': [{'id': 22,
                'name': 'schoolbooks',
                'subterms': [],
                'synonyms': ['literature']}]}]
[{'id': 20,
  'name': 'education',
  'subterms': [{'id': 22,
                'name': 'schoolbooks',
                'subterms': [],
                'synonyms': ['literature']}]},
 {'id': 1,
  'name': 'business',
  'subterms': [{'id': 2,
                'name': 'industry',
                'subterms': [{'id': 21,
                              'name': 'agriculture',
                              'subterms': [],
                              'synonyms': []}],
                'synonyms': []}]}]
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • great code. there's a lot to learn from in that little function. there is a little glitch though: when you submit the parent term (e.g. 'education') it also selects all the child terms. how can this be avoided? basically it should only select all terms in the hierarchy down through the submitted term, but not beyond. – boadescriptor Apr 10 '15 at 19:43
  • I have edited my answer. Perhaps this version does what you want. – Robᵩ Apr 11 '15 at 15:34