3

I am trying to create a regex which attempts to match a sentence.

Here is a snippet.

local utf8 = require 'lua-utf8'
function matchsent(text)
  local text = text
  for sent in utf8.gmatch(text, "[^\r\n]+\.[\r\n ]") do
    print(sent)
    print('-----')
  end
end

However, it does not work like in python for example. I know that Lua uses different set of regex patterns and it's regex capabilities are limited but why does the regex above give me a syntax error? And how a sentence matching regex in Lua would look like?

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
minerals
  • 6,090
  • 17
  • 62
  • 107

1 Answers1

2

Note that Lua uses Lua patterns, that are not "regular" expressions as they cannot match a regular language. They can hardly be used to split a text into sentences since you'd need to account for various abbreviations, spacing, case etc. To split a text into sentences, you need an NLP package rather than one or two regexps due to the complexity of the task.

Regarding

why does the regex above give me a syntax error?

you need to escape special symbols with a % symbol in Lua patterns. See an example code:

function matchsent(text)
    for sent in string.gmatch(text, '[^\r\n]+%.[\r\n ]') do
        print(sent)
        print("---")
    end
end
matchsent("Some text here.\nShow me")

An online demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Yeah, that works, but how about "Dr.Bonn was hiding in a tree." sentence, it will be splitted incorrectly. – minerals Sep 05 '16 at 09:47
  • 1
    Note that Lua patterns are not regular expressions, and are by default unable to match sentences with, say, abbreviations. You might want to use something like `%.%s+(%u)`, replace it with say `.§%1` and then use `[^§]+` with `gmatch` to "split" into "sentences", but that will still be approximate since it won't be able to tell the `. New sentence` from `. Dr. Bonn`. See https://ideone.com/rvzAtX. Note you cannot use anything like Python `(?:(?:Dr|Mrs)\.|[^\r\n.])+\.` in Lua as Lua patterns do not support alternation, let alone quantifying the groups. – Wiktor Stribiżew Sep 05 '16 at 10:00
  • Yes, sentence tokenization is a separate topic on its own, I just hoped I can get away with some approximate pythonic regex. – minerals Sep 05 '16 at 10:04
  • @minerals: Actually, even in Python you'd better approach this task with `nltk`. In Ruby, there is a [regex based library to split into sentences](https://github.com/apohllo/srx-english/blob/master/lib/srx/english/sentence_splitter.rb). However, it cannot be ported to Lua pattern based solution due to heavy use of alternation. – Wiktor Stribiżew Sep 05 '16 at 10:06