0

Description
We downloaded a dataset for our research which contains hierarchical data. However the makers weren't consistent at all. For example sometimes we have something like:

term1:term2:term3:term4

wehereas in other cases we only have:

term4

Example data
As example let's look at this dataset:

data = [['root','test','coffee'],
        ['root', 'test', 'gains'],
        ['root','gains', 'coffee'],
        ['root','milk','bread']]

Now I want to write a code to decipher the complete hierarchy (or at least as good as possible) based on this data and just print the branches upto the end points:

root:test:gains:coffee
root:milk:bread

I'm pretty sure there is a quite simple trick to do this, however I haven't found one yet, what I tried is:

  • Starting with the longest branch (doesn't matter in this case) and then adding new branches whenever I encountered terms that couldn't be fit in the starting branch.
CodeNoob
  • 1,988
  • 1
  • 11
  • 33
  • The way it is, it is not clearly defined what the rules are. For example, what should it do if it encounters both `a:b:c` and `a:c:b`? Should it just abort saying it is not possible? – zvone Aug 30 '18 at 17:31
  • My dataset it too huge to know that beforehand but let's assume (and hope) that it's not possible @zvone – CodeNoob Aug 30 '18 at 17:37
  • I think I would try to solve it using [C3 Linearization](https://en.wikipedia.org/wiki/C3_linearization) - the same mechanism which is used by Python for the MRO. It looks like the same type of problem. – zvone Aug 30 '18 at 19:12

0 Answers0