4

Given a set of strings, I would like to automatically compress each string into a minimum length regular expression. The regular expression for two different strings should only be the same if these strings are identical.

For example:

String 1: ABCCCCCCCCABCCCCCCCCCCBBC = (AB[C]{8}){2}CCBBC

String 2: ABCCCCABCCCCCCCCCCBBC = (AB[C]{4}){2}C{6}BBC

*This is an example of the compression I mean even though it may not be the shortest way of doing it.

Note that string length matters: There is no need to use B{2} to represent string BB as this takes up more characters.

Is there an established method for doing this?

An answer would be a pointer to any academic investigations into this problem with an explanation and/or a solution to this problem, whether theoretical, or as an implementation. In the latter case, I would prefer it if this implementation was in Java.

Michael Anslow
  • 397
  • 3
  • 12
  • 1
    Do you really need the regular expression for this? If compression and fast searching is what you want you can achieve this with suffix trees and tries. – smichak May 14 '15 at 12:27
  • I can't make sense of what you are trying to do here. Are you trying to compress `(list|of|words)` into a minimal regex? And what is the scope of "regular expression" you are speaking of? Theoretical, or the one in programming languages? – nhahtdh May 14 '15 at 12:28
  • 2
    I'm not sure what you're aiming for here. For many strings, the shortest regular expression that matches them may be their current sequence of characters - with any characters with regex meanings escaped and with `^` and `$` added - making for longer strings. – Damien_The_Unbeliever May 14 '15 at 12:28
  • I will elaborate on the strings that I expect a little more, though really, it shouldn't matter if the solution is general. I expect the strings to be something like ABCCCCCCCCABCCCCCCCCCCBBC. I'm updating my question now. – Michael Anslow May 14 '15 at 12:41
  • 1
    Only bored programmers on [codegolf](http://codegolf.stackexchange.com/) would do this, I believe. – YOU May 14 '15 at 12:54
  • 2
    You seem to be interesting in obtaining the shortest regex that matches exactly the input string and just that string. So it must be a regex that matches only one string. Is that accurate? What is your goal? Data compression? LZ methods are superior to this as it seems. – usr May 14 '15 at 12:56
  • @usr yes, that is accurate. Imagine that the strings are representations of paths from two graph databases. A path is represented as a string by concatenating the labels of the relations in the path in order, e.g., relation1relation2relation3relation4 where some of these relations have the same labels. The graph databases are on different machines and it is desirable to reduce the communication overhead involved in communicating paths between machines. Both graph databases can interpret regular expressions across relations and so compression using regular expressions is a natural choice. – Michael Anslow May 14 '15 at 13:11
  • 1
    Seems like you can do it like this: Collapse repeating sequences (CCC => C{3}, ABAB => (AB){2}). Do this in a loop until there are no further changes. As a final pass un-collapse all non-profitable repetitions. Could even be optimal. – usr May 14 '15 at 13:14
  • Yes I think you could be right. – Michael Anslow May 14 '15 at 14:48

1 Answers1

1

Not same as your example, and not minimal in size, but one approach.

"ABCCCCCCCCABCCCCCCCCCCBBC".replace(/(([A-Z])\2{3,})/g,function($0,$1,$2){return $2+$1.length}).replace(/(\d+)/g,'{$1}')
"ABC{8}ABC{10}BBC"

"ABCCCCABCCCCCCCCCCBBC".replace(/(([A-Z])\2{3,})/g,function($0,$1,$2){return $2+$1.length}).replace(/(\d+)/g,'{$1}')
"ABC{4}ABC{10}BBC"
YOU
  • 120,166
  • 34
  • 186
  • 219