Database error while loading UTF-8 Encoded XML Data in Java

Question

I am looking at brainstorming my problem here, not sure if this will trigger loads of shut down or not!

Simplified: I have a system that reads an XML file and loads it into a database.

The XML has a schema with the following:

<?XML version="1.0" encoding="UTF-8"?>

The culprit field has the following schema excerpt:

<xsd:simpleType name="title">
.....
<xsd:restriction base="xsd:string">
 <xsd:minLength value="1"/>
 <xsd:maxLength value="2000"/>
</xsd:restriction>

The schema is UTF-8 compliant, so should support 2000 UTF-8 characters whether they are single or double-byte or multiple bytes.

The XML schema already does a character length check, as defined in the excerpt above.

The problem is sometimes the XSD validates successfully, but the database insert fails, crashes the server with DB error when some multi-byte UTF-8 characters occur in the 'title' XML field.

The database 'title' column is defined as `varchar(2000)`

When the database insert operation fails, ops need to manually reduce the length of the XML field and re-process XML file to fix it.

I have been researching about :

byte vs character length check
schema validation
etc

Could the solution be doing a string byte count check which matches the character count?

I can do a string.getBytes("UTF-8").length in Java, but how would that match the <xsd:maxLength value="2000"/> in the XSD and the varchar(2000)?

What would you suggest as the best way to ensure the XML data for the title field does not exceed a specified length, as defined in XSD. And that the XML data is successfully inserted into the DB as long as XSD is conformed to?

Am I right in assuming a <xsd:maxLength value="2000"/> in the XSD matches the varchar(2000) column definition ?

Since you experience issues on storing strings with multibyte characters, have you checked the character set and collation of your db-field ? — lunatikz, May 04 '21 at 11:41
Hello Lunatikz, the only info I have is the column definition at the mo. I have requested for log etc from ops. Can I just clarify that multi-byte character works, but when it exceeds a certain limit, the DB error occurs. — Mega, May 04 '21 at 13:19

score 2 · Accepted Answer · answered May 04 '21 at 17:35

The schema is UTF-8 compliant

Not exactly, but I think I know what you mean. The XML declaration that you quoted is not specifying anything about the XML instance documents that match this schema. It is simply saying that the XSD itself (i.e. the XML document with root tag <xs:schema>) uses UTF-8 as its character encoding.

XML Schema never concerns itself with the raw bytes of the XML document. It is the XML info set that is being validated. So the maxLength facet on the simple type is saying that you can have up to 2000 characters in this field. As you rightly point out, the actual length in bytes could easily exceed 2000 characters, but the XML processor will not know or care.

sometimes the XSD validates successfully, but the database insert fails

I agree with lunatikz - the most likely explanation is that the DB is incorrectly configured.

Could the solution be doing a string byte count check which matches the character count?

No, that would be fixing the wrong problem. The problem is probably in the database, not in your Java code.

What would you suggest as the best way to ensure the XML data for the title field does not exceed a specified length, as defined in XSD.

I don't think you need to do anything to ensure that. Your XML validator is already checking that for you, and it's probably working just fine.

And that the XML data is successfully inserted into the DB as long as XSD is conformed to?

Configure the DB or its table/column definition so that it stops trying to interpret the input using a single-byte character encoding.

Am I right in assuming a <xsd:maxLength value="2000"/> in the XSD matches the varchar(2000) column definition ?

Yes, both are specifying a field with up to 2000 characters. But the database interprets the word 'character' in a different way from the XML processor.

Wow, thanks Kimbert and Lunatikz. I think the solution is "Configure the DB or its table/column definition so that it stops trying to interpret the input using a single-byte character encoding". If I understand this correctly, you meant at the mo, the DB is configured so each character inserted per field is expected to be one-byte character. So, when 2000 characters are inserted in varchar(2000) field with one multi-byte character of the 2000, it exceeds the field definition and then the error occurs? Right? Any pointers on how to configure bytes per varchar character in the database column? — Mega, May 04 '21 at 22:16
https://dba.stackexchange.com/questions/8239/how-to-easily-convert-utf8-tables-to-utf8mb4-in-mysql-5-5 What is the best non invasive approach of altering the specific field from utf8 to utfmb4 if the database supports utfmb4? Otherwise, what options would you propose. There are 2 columns involved in the same table, one is 250 and the other 2000 length characters. — Mega, May 04 '21 at 23:20
In case it helps, I found this (very old) blog post extremely helpful when I was learning about Unicode and character sets: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ — kimbert, May 05 '21 at 08:57
I think you want to check the used collation, too. It should be the `_unicode_ci`, instead of `general_ci`. For an already populated table I think there are only invasive approaches, because usually you define character sets, when creating the table. Stick with the mentioned solutions, but always create a backup beforehand. — lunatikz, May 05 '21 at 09:05
Thanks Kimbert for the blog. Thanks Lunatikz for the help as well. I appreciate you all. — Mega, May 05 '21 at 12:23

Database error while loading UTF-8 Encoded XML Data in Java

1 Answers1