-1

How can I use TPerRegex with regular Delphi String type avoiding any UTF-8 <> UTF-16 string conversions. It seems Delphi XE5 doesn't come with UTF-16 capable PCRE library?

http://qc.embarcadero.com/wc/qcmain.aspx?d=108941

As of version 8.30 PCRE supports Unicode.

user3060326
  • 187
  • 2
  • 16

2 Answers2

4

AFAIK PCRE library embedded with Delphi is not compiled with UTF-16 APIs, but with UTF-8 APIs.

But once again, UTF-8 is as Unicode-ready as UTF-16! So the PCRE version embedded within Delphi XE5 is 100% Unicode ready... :)

Your link states in addition that current implementation is dead slow due to wrong flags: PCRE_NO_UTF8_CHECK is missing in EMB's code.

You can try to use directly the lib as we did here, and by-pass the slow TPerlRegEx class.

The slowdown does not comes from the fact that the UTF-8 version of the library is used. UTF-8 version is as fast as the UTF-16 version. Nor is the UTF-16 into UTF-8 conversion slow by itself: it will just be a slightly slowdown. But the issue is this missing PCRE_NO_UTF8_CHECK flag...

Arnaud Bouchez
  • 42,305
  • 3
  • 71
  • 159
  • Yes but this isn't version 8.30. Plus I cannot use it on Chinese text, can I? – user3060326 Mar 16 '14 at 20:06
  • @user Do you want to use it with special non-Unicode Chinese text? Do you understand the UTF-8 and UTF-16 are complete Unicode encodings? – David Heffernan Mar 16 '14 at 20:34
  • @DavidHeffernan No. I have a chinese text and I want to use Regex on it. Simple isn't it? I guess I need another library. – user3060326 Mar 16 '14 at 20:46
  • @DavidHeffernan Is `TPerlRegex` in Delphi XE5 version 8.30? If not then its not Unicode ready. Period. – user3060326 Mar 16 '14 at 20:49
  • You can of course use UTF-8 to process Chinese text, with no issue. Library version 8.30 is Unicode ready, via UTF-8 APIs. AFAIK the library is compiled with `SUPPORT_UTF8` conditional, so *does support Unicode* [as stated by official PCRE web site](http://www.regular-expressions.info/pcre.html). I guess you made a confusion between `SUPPORT_UTF8` and `PCRE_NO_UTF8_CHECK`. – Arnaud Bouchez Mar 16 '14 at 20:49
  • @ArnaudBouchez Sorry but there is also `SUPPORT_UTF16`. Thankfully there is full `UTF-16` in JclPCRE.. – user3060326 Mar 16 '14 at 20:57
  • @user It looks like you don't know what Unicode is. It is not the same as UTF-16. That is but one of many Unicode encodings. – David Heffernan Mar 16 '14 at 21:01
0

The solution is to use JclPCRE and statically link PCRE with it.

user3060326
  • 187
  • 2
  • 16