I need to detect URLs efficiently in an input stream during typesetting.
The URL detector will be part of the typesetting flow. It should accept one character at a time as input and should output one character at a time along with the URL the character belongs to. It can buffer text for lookahead in order to do this.
For example if the input stream is "Hello http://foo.com World"
, the output should be:
"H": ""
"e": ""
"l": ""
"l": ""
"o": ""
" ": ""
"h": "http://foo.com"
"t": "http://foo.com"
"t": "http://foo.com"
"p": "http://foo.com"
":": "http://foo.com"
"/": "http://foo.com"
"/": "http://foo.com"
"f": "http://foo.com"
"o": "http://foo.com"
"o": "http://foo.com"
".": "http://foo.com"
"c": "http://foo.com"
"o": "http://foo.com"
"m": "http://foo.com"
" ": ""
"W": ""
"o": ""
"r": ""
"l": ""
"d": ""
Can Ragel be made to stream the input and output as needed?
Incidentally, There is a (Java) ragel URL parser here, which I'm thinking of using as a starting point.