DB2 Regular Expression - Fails When Field Contains #x1A (Escaped `SUBSTITUE` character)

Question

I'm trying to do some simple regular expressions against a DB2 v10.5 database. The database contains a lot of records.

The following works:

SELECT 
  e.EMLNAME,
  XMLCAST(XMLQUERY('fn:matches($LNAME, "^[A-Z]*$")' PASSING (e.EMLNAME) as "LNAME") as integer) as foo
FROM MYDB.EMPLOYEE e;

However when I try to compare the regex results above it then goes from running to failing.

Sample query that fails:

-- Fails
SELECT 
  e.EMLNAME
FROM MYDB.EMPLOYEE e
WHERE XMLCAST(XMLQUERY('fn:matches($LNAME, "^[A-Z]*$")' PASSING (e.EMLNAME) as "LNAME") as integer) = 1;

The error reads:

[SELECT - 0 row(s), 0.000 secs] [Error Code: -20377, SQL State: 0N002] An illegal XML character "#x1A" was found in an SQL/XML expression or function argument that begins with string "O NE".. SQLCODE=-20377, SQLSTATE=0N002, DRIVER=4.18.60

Doing some googling, #x1A look to be the SUBSTITUTE character escaped (http://www.fileformat.info/info/unicode/char/1a/index.htm). Seeing that the XMLQUERY portion works, I tried wrapping it with the XMLCAST to do the comparison which I found in another stack overflow post (Regular Expressions in DB2 SQL).

I had a DBA try the query on his box and it worked. His machine is a windows box, mine is a Mac. We are both using DBVisualizer. He had an older version and upgraded to 9.5.2 and it still worked for him. After I upgraded, I'm still having the same issue, but with a new driver number:

[SELECT - 0 rows, 0.000 secs] [Code: -20377, SQL State: 0N002] An illegal XML character "#x1A" was found in an SQL/XML expression or function argument that begins with string "O NE".. SQLCODE=-20377, SQLSTATE=0N002, DRIVER=4.19.49

I downloaded and installed IntelliJ's Data Grip, as per a recommendation of one of my coworkers. Nice tool, however when I run the query against the last name field then it blows up too, with a similar error, which makes me think that the error message is coming from DB2 or the java driver?

[0N002][-20377] An illegal XML character "#x1A" was found in an SQL/XML expression or function argument that begins with string "O NE".. SQLCODE=-20377, SQLSTATE=0N002, DRIVER=4.7.85

In the connection setup, my connection string looks something like:

jdbc:db2://serverdns:50001/MYDB 
type 4
driver: DB2 (LUW)

Targeting a different field in the db allows for me to run the query without an issue... so looks like a possible data issue, but since windows can run this, it makes me wonder if its a driver, or if I need to do some sort of special escaping before running the query?

I need a developer friendly way to run this query on a mac. Specifically using db visualizer or data grip. Intent is to eventually port over into java code, so if its a driver update, or alter to the query, either will work, as long as it works in one of the query tools mentioned and can be ported into java (with additional escaping, etc.)

Digging in to what was reported, I did a query against last names staring with O and having NE in the name. This lead me to one record that looked like it could be a match for the error. I copied the text of the field and pasted it into an online form, which what looked like a space went away. However when I pasted it into sublime, I saw the SUB character holder.

Doing the query above, but excluding that one record allows for the query to run properly. Based on this, it looks like the query on mac needs to know how to run when characters like SUB are present in the field/string.

What can I do to make this work on my mac?

Update

I'm still looking to see if there is a way to do what was requested. However since we haven't found a straight answer yet, I'm posting some things that have been helpful for working around the issue. Note that I didn't find a way to do ranges with version 10.5, so I had to explicitly do a case for each value that could be bad in the x0E - x1F range.

Identifying records with bad data like:

SELECT
  e.ID, e.EMLNAME
FROM MYDB.EMPLOYEE e
WHERE
  -- Last name checks.
  e.EMLNAME like CONCAT(CONCAT('%', x'0E'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'0F'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'10'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'11'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'12'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'13'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'14'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'15'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'16'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'17'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'18'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'19'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'1A'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'1B'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'1C'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'1D'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'1E'), '%')
  OR e.EMLNAME like CONCAT(CONCAT('%', x'1F'), '%');

Removing the bad characters like:

UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'0E', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'0F', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'10', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'11', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'12', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'13', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'14', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'15', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'16', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'17', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'18', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'19', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'1A', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'1B', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'1C', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'1D', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'1E', '');
UPDATE MYDB.EMPLOYEE e SET e.EMLNAME = REPLACE(e.EMLNAME, x'1F', '');

The "failing" queries cannot possibly all result in the same error that you quote. Are you sure you are showing the _exact_ queries? Are you running them against the same data? — mustaccio, Oct 05 '16 at 23:52
Thanks for the comment @mustaccio, I removed the extra examples to focus on the issue at hand. — James Oravec, Oct 06 '16 at 17:27
It appears that one of the values in `EMLNAME` contains a character with the hex code 1A, which is invalid in XML. Why it fails in one case but not the other I cannot explain. Probably if you DBVisualizer from the picture and run queries in the DB2 CLP both will fail. This assuming that you didn't edit the actual statements you run before posting them here. — mustaccio, Oct 06 '16 at 18:05
@mustaccio, thanks for the comment. I added a couple updates and drilled in to find the record that was causing the mayhem. When that one record is excluded the query runs fine, so I think I've narrowed down the issue quite a bit. If you know how to get past the `SUB` character, then you can collect the 100 point bounty I just added. — James Oravec, Oct 10 '16 at 16:23

score 0 · Answer 1 · answered Oct 10 '16 at 21:07

SWAG, based on data being presented as O␚NEAL which presumably was originally intended to be stored as the character data O'NEAL or as presented in the character string literal notation 'O''NEAL' with an escaped apostrophe:

The encoding of the XML is UTF8, but the Mac wants [has been configured for the database access/connection to request ¿per application code page? setting] that data be returned in ISO-8859-1 or CP819. But, the character in the stored data likely was, originally generated in a Windows application; an application that either converted or otherwise caused the standard apostrophe ' 0x27 ASCII 7-bit character to be [effectively] auto-corrected to the ’ 0xE28099 UTF8, 0x2019 UTF16 "right single quotation mark" [or in MS vernacular, a Smart Quote] character or into the 0x92 "Right Single Quote" ASCII 8-bit CP1252.

Thus irrespective of any encoding such as &#146 or &#x92 or &rsquo [or similar; those are HTML encoding examples, so not necessarily examples of XML encoding of special character] having been done to represent that character, the lack of a character translation from that character/glyph into an equivalent character on the Mac [with that limited-set encoding scheme choice] could correctly be the ␚ 0x1A character for something like CP810 [though represented as � 0xEFBFBD UTF8 or 0xFFFD UTF16]. The Win client of course would be happy with either asking for UTF8 or CP1252 [with the latter possibly the default], because both know of that glyph. But the Mac client would be happier if [re]configured to ask for UTF8 instead of what likely is asked as CP819; i.e. as the likely default.

The following doc references may support my suppositions for function, albeit not what data was originally stored in the row that is causing problems:

DB2 for Linux UNIX and Windows 10.5.0->Database fundamentals->Multicultural support->Database settings for multicultural support->Linux and UNIX distributions and code pages->How DB2 derives locales

DB2 for Linux UNIX and Windows 10.5.0->Database fundamentals->pureXML->XML data encoding->Effects of XML encoding and serialization on data conversion->Retrieval with implicit serialization

I read through what you posted and shared two applications I'm trying to use. If you can share how to reconfigured either of them to use UTF-8 as you suggest and it works, then I'll accept your answer. — James Oravec, Oct 17 '16 at 15:06
Would be unable to do much in that regard. I was just hoping by sharing what I thought was a likely origin for the substitution character and why Win vs Mac might *see* things differently, some of the additional tidbits might lead you to a resolution. FWiW I did not respond for the rep per a bounty, just to help solve -- I did provide a SWAG response, whereas I might not have posted anything w/out a bounty, but that's just because I figured the bounty meant the topic was that much more important to you to resolve. — CRPence, Oct 17 '16 at 15:45
Appreciate the info. I didn't see anything in the settings. Been looking at this from several angles. No dice, yet. — James Oravec, Oct 17 '16 at 15:47

score 0 · Answer 2 · answered Oct 17 '16 at 11:24

0

If your data has unparseable characters, remove them before use:

SELECT 
  e.EMLNAME,
  XMLCAST(XMLQUERY('fn:matches($LNAME, "^[A-Z]*$")'
    PASSING (fn:replace(e.EMLNAME, '[#x0E-#x1F]', '') as "LNAME") as integer) as foo
FROM MYDB.EMPLOYEE e

Or the other query:

SELECT 
  e.EMLNAME
FROM MYDB.EMPLOYEE e
WHERE XMLCAST(XMLQUERY('fn:matches($LNAME, "^[A-Z]*$")'
  PASSING (fn:replace(e.EMLNAME, '[#x0E-#x1F]', '') as "LNAME") as integer) = 1;

answered Oct 17 '16 at 11:24

Bohemian

412,405
93
575
722

getting `An unexpected token ":replace" was found following ".-]+$")' PASSING (fn".` for the first. And `An unexpected token ":replace" was found following "]*$")' PASSING (fn". Expected tokens may include: "".. SQLCODE=-104, SQLSTATE=42601, DRIVER=4.7.85` for the 2nd query. – James Oravec Oct 17 '16 at 14:58
I was able to get `fn:replace` to work if I wrap it in a `XMLQUERY()`. I haven't got it to work in the context above though... I'll keep playing with it and let you know if I get it working. If you come up with the solution first, please update your answer. – James Oravec Oct 17 '16 at 15:29
I spent some time trying variations to get this to work, without success. It seems to me like the `fn:replace` requires the `XMLQUERY`, which would blow up by the use of the characters. There is also a `REPLACE` that is different from `fn:replace`. I'm still playing with that to see if I can get it to replace a range of characters, but have had no luck. I believe this is because in order to do the range, there would have to be better built in regex support, which doesn't seem like it exists at this db2 version or below :( – James Oravec Oct 17 '16 at 16:13
@venom I don't have a DB instance handy, but try nesting the calls: `WHERE XMLCAST(XMLQUERY('fn:matches(fn:replace($LNAME, '[#x0E-#x1F]', '') , "^[A-Z]*$")' PASSING (e.EMLNAME) as "LNAME") as integer) = 1` – Bohemian Oct 17 '16 at 17:07
I was trying stuff like that. What you posted needs double single quotes around the regex and replacement string or double quotes instead. Gettting similar errors about the invalid character. This makes me think that we'd need the other `replace` instead of the `fn:replace`. I don't think the `replace` offers the ability to do a range, so think each character would have to be replaced. :( – James Oravec Oct 17 '16 at 17:19
@Venom if you know it's just one particular character that's causing the problem, use standard SQL `replace()`, something like `WHERE XMLCAST(XMLQUERY('fn:matches($LNAME, "^[A-Z]*$")' PASSING (REPLACE(e.EMLNAME, '#x1F', '') as "LNAME") as integer) = 1` – Bohemian Oct 17 '16 at 19:47

DB2 Regular Expression - Fails When Field Contains #x1A (Escaped `SUBSTITUE` character)

2 Answers2