For a more general regular expression, another option would be to recursively match the greedy regular expression against the previous match, discarding the first and last characters in turn to ensure that you're matching only a substring of the previous match. After matching Marketing and Cricket on the Internet
, we test both arketing and Cricket on the Internet
and Marketing and Cricket on the Interne
for submatches.
It goes something like this in C#...
public static IEnumerable<Match> SubMatches(Regex r, string input)
{
var result = new List<Match>();
var matches = r.Matches(input);
foreach (Match m in matches)
{
result.Add(m);
if (m.Value.Length > 1)
{
string prefix = m.Value.Substring(0, m.Value.Length - 1);
result.AddRange(SubMatches(r, prefix));
string suffix = m.Value.Substring(1);
result.AddRange(SubMatches(r, suffix));
}
}
return result;
}
This version can, however, end up returning the same submatch several times, for example it would find Marmoset
twice in Marketing and Marmosets on the Internet
, first as a submatch of Marketing and Marmosets on the Internet
, then as a submatch of Marmosets on the Internet
.