0

I'm trying to remove the first occurrence of a pattern from a string in Java.

Source string: DUMMY01012016DUMMY01012016

Format is 1-8 alpha-numeric characters followed by a date MMddyyyy followed by any number of alpha-numerics.

Want I'm trying to achieve is remove all beginning chars including the first date occurrence. So in the example below I would be left with DUMMY01012016.

Here is a simplified version of what I have tried: ".*\\d{4}(2016|2017|2015)"

That works well until the pattern is matched more than once. So in the example matcher.replaceFirst("") will replace the entire source string and not just the first occurrence.

Any thoughts would be greatly appreciated.

Thanks. Stephan

Guru Prasad
  • 4,053
  • 2
  • 25
  • 43
sduel
  • 1
  • 1
  • How is `.\d{4}(2016|2017|2015).?` matching your pattern ? It's not according to your format of _1-8 alpha-numeric characters followed by a date MMddyyyy followed by any number of alpha-numerics._ ? –  Apr 14 '16 at 18:50
  • Possible duplicate of [Get the index of a pattern in a string using regex](http://stackoverflow.com/questions/8938498/get-the-index-of-a-pattern-in-a-string-using-regex) – flakes Apr 14 '16 at 18:54
  • By "alpha-numeric characters" do you actually mean *alphabetic* characters, i.e. letters? "Alphanumeric" includes digits. – John Bollinger Apr 14 '16 at 18:57

3 Answers3

1

Your issue is that the * quantifier is greedy. It will cause the preceding sub-pattern to match as many times as possible without causing the overall match to fail (if a match is possible at all). Thus the tail of your pattern .*\d{4}(2016|2017|2015) will match the last occurrence of a date in the string, whereas you want it to match the first.

You can solve this problem by switching to a "reluctant" quantifier instead:

myString.replaceFirst(".*?\d{4}(2016|2017|2015)", "");

There, *? is a reluctant quantifier: it matches zero or more instances of the preceding sub-pattern, as few as possible to enable an overall match (if an overall match is possible).

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
0

This regex should work:

(\w{1,8}?\d{8})(?:\1)
Andy
  • 49,085
  • 60
  • 166
  • 233
Natecat
  • 2,175
  • 1
  • 17
  • 20
0

One of your problems is that the .* is greedy. It means that it matches as much as it can at first. Then the regexp engine starts to step back symbol by symbol until a full match had been found.

So, roughly:

Step 1) .* macthes the whole DUMMY01012016DUMMY01012016

Step 2) The engine steps back symbol by symbol trying to match the remaining part: DUMMY01012016DUMMY0101201 -> DUMMY01012016DUMMY010120 -> DUMMY01012016DUMMY01012 -> .. -> DUMMY01012016DUMMY

Step 3) A complete match is found -> DUMMY01012016DUMMY01012016

You can try something like this:

@Test
public void testReplace()
{
    String string = "DUMMY01012016DUMMY01012016";

    String replaced = string.replaceFirst("\\w{1,8}\\d{4}(2016|2017|2015)", "");

    Assert.assertEquals("DUMMY01012016", replaced);
}

To understand the difference between lazy and greedy you can experiment and make the asterisk lazy by adding a question mark ?., e.g. .*?\d{4}(2016|2017|2015). Then the engine will do the opposite, it will match lazily at the beginning and step forward character by character.

Lachezar Balev
  • 11,498
  • 9
  • 49
  • 72