2

I am trying to find data that has accented characters. I've tried this:

select *
from xml_tmp
where regexp_like (XMLTYpe.getClobVal(xml_tmp.xml_data), unistr('\0090'))

And it works. It finds all records where the XML data field contains É. The problem is that it only matches the upper-case E with an accent. I tried to write a more generic query to find ALL data with accented vowels (a, e, i, o, u, upper and lowercase, with any accents) using equivalence classes. I wanted a regex to match only accented vowels, but I'm not sure how to get it, as equivalence classes such as [[=e=]] match all e's (with or without accents).

Also, this does not actually work:

select *
from xml_tmp
where regexp_like (XMLTYpe.getClobVal(xml_data),'É');

(using Oracle 10g)

FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202

2 Answers2

2

How about

SELECT *
  FROM xml_tmp
 WHERE REGEXP_LIKE
        ( REGEXP_REPLACE
          ( XMLTYpe.getClobVal(xml_tmp.xml_data),
            '[aeiouAEIOU]',
            '-'
          )
          '[[=a=][=e=][=i=][=o=][=u=]]'
        )
;

? That will eliminate any unaccented vowels before performing the REGEXP_LIKE.

(It's ugly, I know. But it should work.)

ruakh
  • 175,680
  • 26
  • 273
  • 307
  • I believe this should work (well, it's missing a comma but I fixed it), although it's been running for several minutes now and then when I tried to kill it PL/SQL Developer hung... Some of the XML objects are rather large... :/ – FrustratedWithFormsDesigner Jan 26 '12 at 16:35
2

After some more experimenting, I have found that this seems to work ok:

select *
from xml_tmp
where regexp_like(XMLTYpe.getClobVal(xml_data),'[^[:graph:][:space:]]') 

I had thought that [:graph:] would include all upper and lower case characters, with or without accents, but it seems that it only matches unaccented characters.


Further experimentation shows that this might not work in all cases. Try these queries:

select *
from dual
where regexp_like (unistr('\0090'),'[^[:graph:][:space:]]');
DUMMY
-------
X
(the match succeeded)

So it looks like the character that's been causing me trouble matches this pattern.

select *
from dual
where regexp_like ('É','[^[:graph:][:space:]]');
DUMMY
-------

(the match failed)

When I try to run this query with the accented E as copied-and-pasted, the match fails! I guess whatever I copied-and-pasted is actually different. Ugh, I think I now hate working with changing character encodings.

FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202