17

In Lua, I'm trying to pattern match and capture:

+384 Critical Strike (Reforged from Parry Chance)

as

(+384) (Critical Strike)

where the suffix (Reforged from %s) is optional.

Long version

I'm trying to match a string in Lua using patterns (i.e. strfind)

Note: In Lua they don't call them regular expressions, they call them patterns because they're not regular.

Example strings:

+384 Critical Strike
+1128 Hit

This is broken down into two parts that I want to capture:

enter image description here

  • The number, with the leading positive or negative indicator; int his case is +384
  • The string, in this case is Critical Strike.

I can capture these using a fairly simple pattern:

enter image description here

And this pattern in lua works:

local text = "+384 Critical Strike";
local pattern = "([%+%-]%d+) (.+)";
local _, _, value, stat = strfind(text, pattern);
  • value = +384
  • stat = Critical Strike

The Tricky Part

Now I need to expand that regular expression pattern to include an optional suffix:

+384 Critical Strike (Reforged from Parry Chance)

Which is broken down into:

enter image description here

Note: I don't particularly care about the optional trailing suffix; meaning that I have no requirement to capture it, Although capturing it would be handy.

This is where I start to get into issues with greedy capturing. Right away the pattern I already have does what I don't want it to:

  • pattern = ([%+%-]%d+) (.+)
  • value = +384
  • stat = Critical Strike (Reforged from Parry Chance)

But let's try to include the suffix in the pattern:

enter image description here

with the pattern:

pattern = "([%+%-]%d+) (.+)( %(Reforged from .+%))?"

And I'm using the ? operator to indicate 0 or 1 appearances of the suffix but that matches nothing.

I blindly tried changing the optional suffix group from parenthesis ( to brackets [:

pattern = "([%+%-]%d+) (.+)[ %(Reforged from .+%)]?"

But now the match is greedy again:

  • value = +384
  • stat = Critical Strike (Reforged from Parry Chance)

Based on the Lua pattern reference):

  • x: (where x is not one of the magic characters ^$()%.[]*+-?) represents the character x itself.
  • .: (a dot) represents all characters.
  • %a: represents all letters.
  • %c: represents all control characters.
  • %d: represents all digits.
  • %l: represents all lowercase letters.
  • %p: represents all punctuation characters.
  • %s: represents all space characters.
  • %u: represents all uppercase letters.
  • %w: represents all alphanumeric characters.
  • %x: represents all hexadecimal digits.
  • %z: represents the character with representation 0.
  • %x: (where x is any non-alphanumeric character) represents the character x. This is the standard way to escape the magic characters. Any punctuation character (even the non-magic) can be preceded by a '%' when used to represent itself in a pattern.
  • [set]: represents the class which is the union of all characters in set. A range of characters can be specified by separating the end characters of the range with a '-'. All classes %x described above can also be used as components in set. All other characters in set represent themselves. For example, [%w_] (or [_%w]) represents all alphanumeric characters plus the underscore, [0-7] represents the octal digits, and [0-7%l%-] represents the octal digits plus the lowercase letters plus the '-' character. The interaction between ranges and classes is not defined. Therefore, patterns like [%a-z] or [a-%%] have no meaning.
  • [^set]: represents the complement of set, where set is interpreted as above.

For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.

The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.

and the magic matchers:

  • *, which matches 0 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
  • +, which matches 1 or more repetitions of characters in the class. These repetition items will always match the longest possible sequence;
  • -, which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
  • ?, which matches 0 or 1 occurrence of a character in the class;

I noticed that there's a greedy *, and a non-greedy - modifier. Since my middle string matcher:

(%d) (%s) (%s)

seems to be absorbing text until the end, perhaps i should try to make it non-greedy, by changing the * to a -:

oldPattern = "([%+%-]%d+) (.*)[ %(Reforged from .+%)]?"
newPattern = "([%+%-]%d+) (.-)[ %(Reforged from .+%)]?"

Except now it fail to match:

  • value = +384
  • stat = nil

Rather than the middle group capturing "any" character (i.e. .), I tried a set that contains everything except (:

pattern = "([%+%-]%d+) ([^%(]*)( %(Reforged from .+%))?"

and from there the wheels came off the wagon:

local pattern = "([%+%-]%d+) ([^%(]*)( %(Reforged from .+%))?"
local pattern = "([%+%-]%d+) ((^%()*)( %(Reforged from .+%))?"
local pattern = "([%+%-]%d+) (%a )+)[ %(Reforged from .+%)]?"

I thought that I was close with:

local pattern = "([%+%-]%d+) ([%a ]+)[ %(Reforged from .+%)]?"

which captures

- value = "+385"
- stat = "Critical Strike "  (notice the trailing space)

So this is where I bang my head against the pillow and go to sleep; I can't believe I've spent four hours on this regex....pattern.


@NicolBolas The set of all possible strings, defined using a pseudo-regular expression language, are:

+%d %s (Reforged from %s)

where

  • + represents either the Plus Sign (+) or the "Minus Sign" (-)
  • %d represents any latin digit character (e.g. 0..9)
  • %s represents any latin uppercase or lowercase letters, or embedded spaces (e.g. A-Za-z)
  • the remaining characters are literals.

If i had to write a regular expression that obviously tries to do what i want:

\+\-\d+ [\w\s]+( \(Reforged from [\w\s]+\))?

But I can give you near practically complete list of all values I'm likely to encounter in the wild if I didn't explain it well enough.

  • +123 Parry positive number, single word
  • +123 Critical Strike positive number, two words
  • -123 Parry negative number, single word
  • -123 Critical Strike negative number, two words
  • +123 Parry (Reforged from Dodge) positive number, single word, optional suffix present with single word
  • +123 Critical Strike (Reforged from Dodge) positive number, two words, optional suffix present with two words
  • -123 Parry (Reforged from Hit Chance) negative number, single word, optional suffix present with two words
  • -123 Critical Strike (Reforged from Hit Chance) negative number, two words, optional suffix present with two words

There are bonus patterns it would seem obvious that the patterns would also match:

  • +1234 Critical Strike Chance four digit number, three words
  • +12345 Mount and run speed increase five digit number, five words
  • +123456 Mount and run speed increase six digit number, five words
  • -1 MoUnT aNd RuN sPeEd InCrEaSe one digit number, five words
  • -1 HiT (Reforged from CrItIcAl StRiKe ChAnCe) negative one digit number, one word, optional suffix present with 3 words

And while the ideal pattern should match the above bonus entries, it does not have to.

Localization

In reality all "numbers" i am attempting to parse out will be localized, e.g.:

  • +123,456 in English (en-US)
  • +123.456 in Germany (de-DE)
  • +123'456 in French (fr-CA)
  • +123 456 in Estonian (et-EE)
  • +1,23,456 in Assamese (as-IN)

Any answer must not attempt to account for these localization issues. You do not know the locale a number will be presented from, that is why the number localization has been removed from the question. You must strictly assume that numbers contain plus sign, hyphen minus, and latin digits 0 through 9. I already know how to parse localized numbers. This question is about trying to match the optional suffix with a greedy pattern parser.

Edit: You really didn't have to try to handle localized number. At some level trying to handle them, without knowing the locale, is wrong. For example, I didn't include all possible localizations of numbers. For another: I don't know what future localizations might exist in the future.

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • It's difficult to provide an answer, as you don't seem to describe what all of the possible strings could be. – Nicol Bolas Nov 29 '12 at 05:54
  • The function `strfind` is from legacy version Lua 4.0. Are you really using that old release? – prapin Nov 29 '12 at 06:23
  • @prapin Judging from the text content, this is Lua in World of Warcraft, which is Lua 5.1. `strfind` is probably just `strfind = string.find`. – Mud Nov 29 '12 at 07:14

4 Answers4

8

Hmm I don't have Lua4 installed but this pattern works under Lua5. I would expect it to work for Lua4 as well.

Update 1: Since additional requirements have been specified (localization) I've adapted the pattern and the tests to reflect these.

Update 2: Updated the pattern and tests to deal with an additional class of text containing a number as mentioned by @IanBoyd in the comments. Added an explanation of the string pattern.

Update 3: Added variation for the case where the localized number is dealt with separately as mentioned in the last update to the question.

Try:

"(([%+%-][',%.%d%s]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"

or (no attempt to validate number localization tokens) - just take anything which is not a letter with a digit sentinel at the end of the pattern:

"(([%+%-][^%a]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"

Neither of the patterns above are meant to deal with numbers in scientific notation (e.g: 1.23e+10)

Lua5 test (Edited to clean up - tests getting cluttered):

function test(tab, pattern)
   for i,v in ipairs(tab) do
     local f1, f2, f3, f4 = v:match(pattern)
     print(string.format("Test{%d} - Whole:{%s}\nFirst:{%s}\nSecond:{%s}\nThird:{%s}\n",i, f1, f2, f3, f4))
   end
 end

 local pattern = "(([%+%-][',%.%d%s]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"
 local testing = {"+123 Parry",
   "+123 Critical Strike",
   "-123 Parry",
   "-123 Critical Strike",
   "+123 Parry (Reforged from Dodge)",
   "+123 Critical Strike (Reforged from Dodge)",
   "-123 Parry (Reforged from Hit Chance)",
   "-123 Critical Strike (Reforged from Hit Chance)",
   "+122384    Critical    Strike      (Reforged from parry chance)",
   "+384 Critical Strike ",
   "+384Critical Strike (Reforged from parry chance)",
   "+1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+12345 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+123456 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
   "-1 MoUnT aNd RuN sPeEd InCrEaSe (Reforged from CrItIcAl StRiKe ChAnCe)",
   "-1 HiT (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+123,456 +1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+123.456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+123'456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+123 456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+1,23,456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
   "+9 mana every 5 sec",
   "-9 mana every 20 min (Does not occurr in data but gets captured if there)"}
 test(testing, pattern)

Here's a breakdown of the pattern:

local explainPattern =  
   "(" -- start whole string capture
   ..
   --[[
   capture localized number with sign - 
   take at first as few digits and separators as you can 
   ensuring the capture ends with at least 1 digit
   (the last digit is our sentinel enforcing the boundary)]]
   "([%+%-][',%.%d%s]-[%d]+)" 
   ..
   --[[
   gobble as much space as you can]]
   "%s*"
   ..
   --[[
   capture start with letters, followed by anything which is not a bracket 
   ending with at least 1 letter]]
   "([%a]+[^%(^%)]+[%a]+)"
   ..
   --[[
   gobble as much space as you can]]
   "%s*"
   ..
   --[[
   capture an optional bracket
   followed by 0 or more letters and spaces
   ending with an optional bracket]]
   "(%(?[%a%s]*%)?)"
   .. 
   ")" -- end whole string capture
Anthill
  • 1,219
  • 10
  • 20
  • For "+384 Critical Strike" your pattern captures "+384", "Critical Strik", and "e". – Mud Nov 29 '12 at 07:20
  • The second capture is including the trailing space (the space preceding the `(` when the suffix is present). Don't think there's any getting around that. – Mud Nov 29 '12 at 07:56
  • Actually there is - it's a tricky one but updated the example. – Anthill Nov 29 '12 at 08:29
  • i used your previous incarnation of the pattern, as i have a library that knows how to match numbers that are (only) in the current locale. i did have to make one tweak to the answer in order to satisfy all my unit tests. i forgot about one class of text that i have to parse `+9 Mana every 5 sec.`. In that case i would want to capture `+9` and `Mana every 5 sec`. This class of input text would never (in practice) be followed by the optional suffix. i'll leave it as an exercise to figure out the adjustment to handle that case. – Ian Boyd Nov 30 '12 at 00:04
  • And good work; i never would have figured it out. Also because i have no idea how it is working now! – Ian Boyd Nov 30 '12 at 00:05
  • You're welcome - and thanks for the feedback. I've updated my answer with the additional use case you mentioned and added an explanation of the pattern. – Anthill Nov 30 '12 at 07:44
1

Why parse this in one pattern when you can use several?

First, get the number:

local num, rest = string.match(test_string, "([%+%-]?%d+)%S*(.+)")

Then make a table that enumerates the possibilities for the hit type.

local hitTypes =
{
  "Hit",
  "Critical Strike",
  -- Insert more.
}

Now, iterate over the list, testing against each one.

local hitIndex = nil
local reforge = nil

for i, htype in ipairs(hitTypes) do
  local final = string.match(rest, htype .. "%S*(.*)")
  if(final) then
    hitIndex = i
    reforge = string.match(final, "%(Reforged from (.+)%)")
  end
end

Lua patterns are limited, so it's best to use actual code to avoid their limitations.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • "Why parse this in one pattern when you can use several?" This looks like combat log parsing in WoW. You could be doing a *lot* of these every frame and you don't want to kill your framerate. Much better to take 1 pass through the string than N, if you can help it. – Mud Nov 29 '12 at 07:15
  • @Mud: Then only do 5 of them every frame; buffer the rest and pull from the buffer each frame. – Nicol Bolas Nov 29 '12 at 08:22
  • The buffer will grow indefinitely as you fall further and further behind. Moot, really; I'm sure your approach is fast enough. I was just answer the question you posed: you take on pass through the data rather than several because it's faster ([probably](http://programmer.97things.oreilly.com/wiki/index.php/Measure_Don't_Guess)). In any case, Anthill posted a pattern that captures what the OP wanted in one pass. – Mud Nov 29 '12 at 08:49
1

It's much simpler, instead of just matching the pattern, you can look directly for the short output to the string you need, you can use it string.gsub

Example:

local testing = {"+123 Parry",
"+123 Critical Strike",
"-123 Parry",
"-123 Critical Strike",
"+123 Parry (Reforged from Dodge)",
"+123 Critical Strike (Reforged from Dodge)",
"-123 Parry (Reforged from Hit Chance)",
"-123 Critical Strike (Reforged from Hit Chance)",
"+122384    Critical    Strike      (Reforged from parry chance)",
"+384 Critical Strike ",
"+384Critical Strike (Reforged from parry chance)",
"+1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+12345 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123456 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
"-1 MoUnT aNd RuN sPeEd InCrEaSe (Reforged from CrItIcAl StRiKe ChAnCe)",
"-1 HiT (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123,456 +1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123.456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123'456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+123 456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+1,23,456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
"+9 mana every 5 sec",
"-9 mana every 20 min (Does not occurr in data but gets captured if there)"}

for k,v in ipairs(testing) do
  local result = string.gsub(v, "([%+%-][',%.%d%s]-[%+%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?)", '(%1) (%2) %3')
  print(result)
end

Output

(+123) (Parry) 
(+123) (Critical Strike) 
(-123) (Parry) 
(-123) (Critical Strike) 
(+123) (Parry) (Reforged from Dodge)
(+123) (Critical Strike) (Reforged from Dodge)
(-123) (Parry) (Reforged from Hit Chance)
(-123) (Critical Strike) (Reforged from Hit Chance)
(+122384) (Critical    Strike) (Reforged from parry chance)
(+384) (Critical Strike) 
(+384) (Critical Strike) (Reforged from parry chance)
(+1234) (Critical Strike Chance) (Reforged from CrItIcAl StRiKe ChAnCe)
(+12345) (Mount and run speed increase) (Reforged from CrItIcAl StRiKe ChAnCe)
(+123456) (Mount and run speed increase) (Reforged from CrItIcAl StRiKe ChAnCe)
(-1) (MoUnT aNd RuN sPeEd InCrEaSe) (Reforged from CrItIcAl StRiKe ChAnCe)
(-1) (HiT) (Reforged from CrItIcAl StRiKe ChAnCe)
(+123,456 +1234) (Critical Strike Chance) (Reforged from CrItIcAl StRiKe ChAnCe)
(+123.456) (Critical Strike Chance) (Reforged from CrItIcAl StRiKe ChAnCe)
(+123'456) (Critical Strike Chance) (Reforged from CrItIcAl StRiKe ChAnCe)
(+123 456) (Critical Strike Chance) (Reforged from CrItIcAl StRiKe ChAnCe)
(+1,23,456) (Critical Strike Chance) (Reforged from CrItIcAl StRiKe ChAnCe)
(+9) (mana every 5 sec) 
(-9) (mana every 20 min) (Does not occurr in data but gets captured if there)
Webrom
  • 23
  • 4
0
function match_some_stat_thing(str)
    local sign, amount, label, note = string.match(str.."()", "^([%+%-])(%d+) ([%a ]-) ?(%b())")
    return sign == "+" and amount or -amount, label, string.match(note, "%((.*)%)")
end
print(string.format("%d %q %q", match_some_stat_thing("+384 Critical Strike (Reforged from Parry Chance)")))
print(string.format("%d %q %q", match_some_stat_thing("+384 Critical Strike")))
print(string.format("%d %q %q", match_some_stat_thing("+384 Critical Strike ")))

Not a single pattern, but it works.

Deco
  • 5,112
  • 1
  • 16
  • 17