How to learn regular expressions

Question

I.e., I get a list of words and I want to construct a simple regular expression from that which matches at least all of the words (but maybe more).

I want to have an algorithm for that. I.e. input of that algorithm is a list of words and output is a regular expression. Obviously, there will be some restrictions. Like either the regular expression will always match more words if it should match an infinite amounts of words and I only give it a finite number of words. Or I will need some more compact representation of the input. Or I am also thinking about giving me some regular expression as input and a list of additional words and I want to get a regular expression which matches all of them together (and maybe more). In any case, it should try to construct a regular expression which is as simple as possible.

What techniques are availalbe which can do that?

I was quite misunderstood. I know the general principles behind regular expressions. I know what it is. And in most cases I can come up quite easily with a regular expression to some language by hand. But I am searching for algorithms which does that.

Again formulated a bit different:

Let L be a regular language. Let M_n be a finite subset of L with n elements. Let M_n be a subset of M_(n+1).

I want to have an algorithm LRE which gets a finite set of words and outputs a regular expression. And I want to have the property:

lim_n->infinity | diff( LRE(M_n), L ) | = 0

@Keng: Usually, the task to generalize from a bunch of samples is called "learning" in computer science. — Albert, Dec 10 '10 at 19:28
Oh, you want the algorithm to continuously get better and better at building the RE statments.... — Keng, Dec 10 '10 at 19:52
@Keng: No, I want to continuously increase the probability by giving more and more words that the constructed RE matches the regular language of these words. — Albert, Dec 10 '10 at 20:07
@Albert -- on SO, the `learning` tag is more associated with a human trying to lean something; I updated it to `machine-learning` — Michael Paulukonis, Dec 10 '10 at 21:27
possible duplicate of http://stackoverflow.com/questions/895425/automatic-regex-builder — Michael Paulukonis, Dec 10 '10 at 21:38
@Michael: Ah, that indeed seems to be related (a bit too specific though). Btw., they refer to this as *language learning* or *language inference*. — Albert, Dec 11 '10 at 15:37

FrustratedWithFormsDesigner · Answer 1 · 2010-12-10T16:50:57.843

2

See this website to learn the general principles: http://www.regular-expressions.info/

If all you have is a list of words such as dog, cat, cow, mouse, the simplest regex to match any of these would be: dog|cat|cow|mouse, but note that it will also match doggone, scatological, etc... It may or may not match DOGGONE, COWPATTY, etc... depending on whether or not your are doing case-sensitive matching. Better patterns can be given if more particulars about your problem are given.

It's also a good idea to get a regex testing tool. I like Expresso, it is good for .NET patterns. Since regex capabilties may vary between platforms, make sure your tool supports your platform.

edited Dec 10 '10 at 16:50

answered Dec 10 '10 at 16:45

FrustratedWithFormsDesigner

26,726
31
139
202

`dogone` is not in `L(dog)`. Only `dogΣ*` would match also `dog`. `dog` does only match to a suffix of `dogone`. But maybe some practical tools also accept it in that case; I don't know exactly. – Albert Dec 10 '10 at 19:02
Hm, where exactly does the website tell me how to construct a regular expression from a list of words? Can you post a direct link to the algorithm? – Albert Dec 10 '10 at 19:03
@Albert: I'm a bit confused, are you trying to build a regular expression engine for a particular subset of regexes? – FrustratedWithFormsDesigner Dec 10 '10 at 20:12
No, I want to have an algorithm which tries to "guess" a good regular expression based on a list of words (i.e. which tries to find regular patterns in it). – Albert Dec 11 '10 at 15:31

score 1 · Accepted Answer · answered Dec 21 '10 at 17:06

This problem has been looked at the last decade. You might want to google DFA learning, and download a couple of papers to get a sense of the state of the art.

Once you have the DFA generating a regular expression is trivial. To avoid the problems @FrustratedWithDesign mentions some conditions such as generating the DFA with the least amount of nodes is introduced, from a machine learning point of view this is similar to having a regularization condition for the simplest hypothesis.

score 0 · Answer 3 · answered Dec 10 '10 at 16:46

0

Use this site to learn the basics and use rubular for live testing.

answered Dec 10 '10 at 16:46

meder omuraliev

183,342
71
393
434

You misunderstood me. I want to construct it algorithmically. I reformulated my question a bit. – Albert Dec 10 '10 at 19:11

score 0 · Answer 4 · edited May 23 '17 at 11:47

0

If you have a list of distinct words that you want to match -- it doesn't sound like you're matching on something that a regular expression is best at.

As FrustratedWithFormsDesigner pointed out -- your regex is going to be mapped to the items in the list in the worst case; best case you can find common prefixes. And if you automate the regex construction, why bother with the regex? What is the use-case?

But if your list is beyond a trivial size, you'd probably be better off looping through it.

edited May 23 '17 at 11:47

Community

1
1

answered Dec 10 '10 at 16:50

Michael Paulukonis

9,020
5
48
68

You misunderstood me. I want to construct it algorithmically. I reformulated my question a bit. – Albert Dec 10 '10 at 19:12

score 0 · Answer 5 · answered Dec 10 '10 at 17:51

http://www.regular-expressions.info is a fantastic site for Regex Reference.

When building a complex regex, I typically use Expresso. It's a free app that helps you build Regular expressions. It breaks them down into a tree view so that it is easy to see what all parts are doing. http://www.ultrapico.com/Expresso.htm It is made to work with .NET languages, but there are plenty of tools like this available for different languages.

To build my Regex, I'll usually start with an acceptable value and start replacing characters with Regex syntax.

For example, if I was trying to match a URL I would start with

http://www.mydomain.com

I would then escape anything that needs escaping

http://www\.mydomain\.com

then I would start replacing characters

http://www\.\w+\.\w+\.\w+

obviously this expression needs some more work, but you get the idea

You misunderstood me. I want to do that algorithmically. I reformulated my question a bit. — Albert, Dec 10 '10 at 19:10

score 0 · Answer 6 · answered Dec 10 '10 at 18:50

0

Here is a site for Perl regex:

http://perldoc.perl.org/perlre.html

answered Dec 10 '10 at 18:50

You misunderstood me. I want to do that algorithmically. I reformulated my question a bit. – Albert Dec 10 '10 at 19:10

How to learn regular expressions

6 Answers6