0

I need to parse out some strings from two types of files

File type one lines with subpattern

l('some string')
l('some other string', $mod = "anything")

File type two lines with subpattern

{l s='some string' mod='anything'}
{l s='some other string' mod='anything'}

From both of these file types I want to parse out "some string", "some other string", ....

Now what would be better for performance:

a) using preg_match_all - I am quite struggeling with this one because subpatterns can also contain anything that wraps them...

b) using custom file reading/parsing (char by char, storing previous char and state, ...)

??

Thanks in advance.

jave.web
  • 13,880
  • 12
  • 91
  • 125
  • Probably best to use a language that's good at parsing. This isn't PHP's strong point. I suggest you write this code in Rebol. – Darth Egregious Sep 04 '15 at 19:41
  • Sadly, that is not an option, I really need to do this in PHP. And I have a *feeling* it would be the best to parse the file char by char by myself, because I presume fgetc uses some native basic C/C++ stuff so it should be faster then preg_matching – jave.web Sep 04 '15 at 19:43
  • @Fuser97381 also, it might be better because I won't have to store the whole file string, but only possible math, previous character and state ... :) – jave.web Sep 04 '15 at 19:44
  • If, like you say, subpatterns may contain everything, then what you ask is impossible. Consider `{l s='some' mod='words' mod='here'}`. You can't parse it unambigously. However, if you know that quotes inside subpatters are escaped, then it's doable by regexes, google regex quotes within quotes ([example](http://stackoverflow.com/questions/5551046/regular-expression-quotes-within-quotes)). – Serge Seredenko Sep 04 '15 at 19:53
  • exactly my thougt - Yes quotes will always be escaped, however the main question still remains... – jave.web Sep 04 '15 at 20:07
  • Do a benchmark, what's the problem? – Serge Seredenko Sep 04 '15 at 20:12
  • I mean, you could quickly prototype an automaton using `strpos` on quote with offset, just to make sure that it's faster than preg matching. – Serge Seredenko Sep 04 '15 at 20:15
  • Well if that is true, checking getc() is practicly the same, so it should be better, thanks :) – jave.web Sep 04 '15 at 20:29

0 Answers0