0

I have a list of strings for which I need to find ALL common unique substrings (actually paths) with minimal length in them. Example:

/a/b/c

/a/b

/a

/d/e/f

/d/e

/g/h

For this input, I need the following result:

/a

/d/e

/g/h

As you see, I need the paths (or substrings) with the minimal length that have a unique prefix. /a is the minimal substring for all paths starting with /a. /d/e is the minimal substring for all paths starting with /d/e. The same goes for /g/h.

A practical application of this is to find all roots of the path trees that have a certain file in it to analyze them further. Consider this example:

/a/b/c/index.html

/a/b/index.html

/a/index.html

/d/e/f/index.html

/d/e/index.html

/g/h/index.html

Let's say I want to have the topmost (in terms of the root) paths that contain an index.html file. As a result, I want "/a/index.html", "/d/e/index.html" and "/g/h/index.html".

Any ideas? There is a lot of theory and examples for the "simple" longest common substring problem, but I have not yet found a solution for finding ALL the common longest substrings efficiently.

A solution with pseudo code would be highly appreciated.

  • 2
    Is the longest substring always going to be a member of the group? And, what language are you using? (You can add a tag for that.) And, I don't see why `/g/h` would be part of the output. – Gordon Linoff Nov 06 '15 at 16:39
  • From your example I assume you need something like the smallest set of substrings that covers your entire list. Is this correct? – biziclop Nov 06 '15 at 16:43
  • 1
    There is no common substring if you consider all the items of your input, are you talking about 3 different inputs and their corresponding outputs? – Aaron Nov 06 '15 at 16:45
  • Dear upvoter, please do tell us, what the task is... – Karoly Horvath Nov 06 '15 at 16:47
  • 1
    @KarolyHorvath No, I mean that `a` covers the first three entries (`abc`, `ab` and `a`), `de` covers `de` and `def` and `gh` covers `gh`. So there is no element in the source set which hasn't got a substring in the result set. That's my guess :) – biziclop Nov 06 '15 at 16:49
  • Yes. and no trivial substrings, like the empty string. – biziclop Nov 06 '15 at 16:54
  • I think it's still ambiguous, but I'm too lazy to produce an example because the OP doesn't give a .... – Karoly Horvath Nov 06 '15 at 16:55
  • I agree with @KarolyHorvath, why is `/a/b` not one of the answers? – maraca Nov 06 '15 at 17:10
  • Please excuse my late response. I made the task description clearer now. @maraca: /a/b is not one of the answers because it is not a minimal solution in terms of length. – Peter Tarragona Nov 07 '15 at 11:13
  • Your Definition is ambiguous. If minimality is the goal, then / is the solution. If you specifically exclude /, then you just separate by the first path component and so in. Nothing algorithmically interesting – Niklas B. Nov 07 '15 at 12:10
  • Or in other words: There is a tradeoff to be had between minimizing the number of solution strings and the size of prefixes that each string has in the solution. You did not define how you want that tradeoff to be resolved – Niklas B. Nov 07 '15 at 12:20

1 Answers1

0

Now with the improved description I think the following algorithm will do:

  1. Split the list of strings into a list of segments (list of array of string)
  2. Start with i = 1 and increase it every iteration doing the following (step 3 and 4) until there are no more items in the list of segments:
  3. Add all segment arrays with length i to the list (if not already in there) of current solutions and the corresponding paths to the final solution.
  4. Remove all items from the list of segments for which the first i items are the same as for one of the items in the current solution (then reset the current solution).
maraca
  • 8,468
  • 3
  • 23
  • 45