3

I have hard time porting POSIX regex to Lua string patterns.

I'm dealing with html response from which I would like to filter checkboxes that are checked. Particularly I'm interested in value and name fields of each checked checkbox:

Here are examples of checkboxes I'm interested in:

<input class="rid-2 form-checkbox" id="edit-2-access-comments" name="2[access comments]" value="access comments" checked="checked" type="checkbox">

<input class="rid-3 form-checkbox real-checkbox" id="edit-3-administer-comments" name="3[administer comments]" value="administer comments" checked="checked" type="checkbox">

as opposed I'm not interested in this (unchecked checkbox):

<input class="rid-2 form-checkbox" id="edit-2-access-printer-friendly-version" name="2[access printer-friendly version]" value="access printer-friendly version" type="checkbox">

Using POSIX regex I've used following pattern in Python: pattern=r'name="(.*)" value="(.*)" checked="checked"' and it just worked.

My first approach in Lua was simply to use this: pattern ='name="(.-)" value="(.-)" checked="checked"' but it gave strange results (first capture was as expected but the second one returned lots of unneeded html).

I've also tried following pattern: pattern = 'name="(%d?%[.-%])" value="(.-)"%s?(c?).-="?c.-"%s?type="checkbox"'

This time, in second capture content of value was returned but all checkboxes where matched (not only those with checked="checked" field)

For completeness, here's the Lua code (snippet from my Nmap NSE script) that attempts to do this pattern matching:

  pattern = 'name="(.-)" value="(.-)" checked="checked"' 
  data = {}
  for name, value in string.gmatch(res.body, pattern) do
    stdnse.debug(1, string.format("%s %s", name, value))
  end
Yu Hao
  • 119,891
  • 44
  • 235
  • 294
mzet
  • 577
  • 2
  • 7

2 Answers2

1

I've used following pattern in Python: pattern=r'name="(.*)" value="(.*)" checked="checked"' and it just worked.

Python re is not POSIX compliant and . matches any char but a newline char there (in POSIX and Lua, . matches any char including a newline).

If you want to match a string that has 3 attributes above one after another, you should use something like

local pattern = 'name="([^"]*)"%s+value="([^"]*)"%s+checked="checked"'

Why not [^\r\n]-? Because in case there are two tags on one line with the first having the first and/or second attribute and the second having the second and third or just second (and even if there is a third tag with the third attribute while the first one contains the first two attributes), there will be match, as [^\r\n] matches < and > and can "overfire" across the tags.

Note that [^"]*, a negated bracket expression, will only match 0+ chars other than " thus restricting the matches within one tag.

See Lua demo:

local rx = 'name="([^"]*)"%s+value="([^"]*)"%s+checked="checked"'
local s = '<li name="n1"\nvalue="v1"><li name="n2"\nvalue="v1" checked="checked"><li name="n3"\nvalue="v3"   checked="checked">'
for name, value in string.gmatch(s, rx) do
  print(name, value)
end

Output:

n2  v1
n3  v3
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

(Updated based on comments) The pattern doesn't work when a line that doesn't have checked="checked" is before a line with checked="checked" in the input as .- expression captures unnecessary parts. There are several ways to avoid this; one suggested by @EgorSkriptunoff is to use ([^"]*) as the pattern; another is to exclude new lines ([^\r\n]-). The following example prints what you expect:

local s = [[
<input class="rid-2 form-checkbox" id="edit-2-access-comments" name="2[access comments]" value="access comments" checked="checked" type="checkbox">
<input class="rid-2 form-checkbox" id="edit-2-access-printer-friendly-version" name="2[access printer-friendly version]" value="access printer-friendly version" type="checkbox">
<input class="rid-3 form-checkbox real-checkbox" id="edit-3-administer-comments" name="3[administer comments]" value="administer comments" checked="checked" type="checkbox">
]]
local pattern = 'name="([^\r\n]-)" value="([^\r\n]-)" checked="checked"' 
for name, value in string.gmatch(s, pattern) do
  print(name, value)
end

The output:

2[access comments]  access comments
3[administer comments]  administer comments
Paul Kulchenko
  • 25,884
  • 3
  • 38
  • 56