I am looking at brainstorming my problem here, not sure if this will trigger loads of shut down or not!
Simplified: I have a system that reads an XML file and loads it into a database.
The XML has a schema with the following:
<?XML version="1.0" encoding="UTF-8"?>
The culprit field has the following schema excerpt:
<xsd:simpleType name="title">
.....
<xsd:restriction base="xsd:string">
<xsd:minLength value="1"/>
<xsd:maxLength value="2000"/>
</xsd:restriction>
The schema is UTF-8 compliant, so should support 2000 UTF-8 characters whether they are single or double-byte or multiple bytes.
The XML schema already does a character length check, as defined in the excerpt above.
The problem is sometimes the XSD validates successfully, but the database insert fails, crashes the server with DB error when some multi-byte UTF-8 characters occur in the 'title' XML field.
The database 'title' column is defined as `varchar(2000)`
When the database insert operation fails, ops need to manually reduce the length of the XML field and re-process XML file to fix it.
I have been researching about :
- byte vs character length check
- schema validation
- etc
Could the solution be doing a string byte count check which matches the character count?
I can do a string.getBytes("UTF-8").length in Java, but how would that match the <xsd:maxLength value="2000"/>
in the XSD and the varchar(2000)
?
What would you suggest as the best way to ensure the XML data for the title field does not exceed a specified length, as defined in XSD. And that the XML data is successfully inserted into the DB as long as XSD is conformed to?
Am I right in assuming a <xsd:maxLength value="2000"/>
in the XSD matches the varchar(2000)
column definition ?