0

I have two approach of validating my xml against xsd which is stored in resource of my legacy application. Validations are done 1000+ times daily and code runs 24*7.

Approach 1: Is to create static SchemaFactory

public class XmlValidator {
    private static final SchemaFactory schemaFactory;
    
    static {
        schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI, "com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory", ClassLoader.getSystemClassLoader()); // This is done because of conflict due to xerces from //external jar and from Java
        schemaFactory.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
        schemaFactory.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
    }
    
    public boolean validateXmlWithXsd(String inputXml, String xsd) {
        try (InputStream stream = new ByteArrayInputStream(xsd.getBytes(StandardCharsets.UTF_8));
             StringReader reader = new StringReader(inputXml)) {
            
            Source schemaFile = new StreamSource(stream);
            Schema schema = schemaFactory.newSchema(schemaFile);
            
            Validator validator = schema.newValidator();
            validator.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
            validator.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
            
            Source source = new StreamSource(reader);
            validator.validate(source);
            
            return true; // Validation successful
        } catch (Exception e) {
            // Handle validation errors here
            e.printStackTrace();
            return false; // Validation failed
        }
    }
}

Approach 2: (just changing method validateXmlWithXsd without static block )

public boolean validateXmlWithXsd (String inputXml, String xsd) {
try{
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI, "com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory", ClassLoader.getSystemClassLoader())
InputStream stream = new ByteArrayInputStream (xsd.getBytes(StandardCharsets.UTF_8));
Source schemaFile = new StreamSource(stream);


Schema schema = schemaFactory.newSchema(schemaFile);
schemaFactory.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
schemaFactory.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
Validator validator = schema. newValidator();
validator.setProperty(XMLConstants.ACCESS_EXTERNAL_DTD, "");
validator.setProperty(XMLConstants.ACCESS_EXTERNAL_SCHEMA, "");
Source source = new StreamSource(new StringReader (inputXml)):
validator.validate(source);
}
}

My thoughts: Calling SchemaFactory.newInstance with the system class loader, especially when invoked repeatedly in a long-running thread (1000 times + in a 24*7 scenario), can potentially lead to performance issues and might not be the most efficient approach. Mainly due to Class Loading Overhead. So, I prefer approach 1. Also, In the case of a static SchemaFactory, I think it helps with memory:

Single Instance: When I declare a static SchemaFactory, there's only one instance of it shared across all instances of class. This means create the SchemaFactory only once, and all subsequent calls to XML validation method use the same factory.

Resource Sharing: The SchemaFactory is relatively heavy to create, and it can be configured with various properties. By making it static, we avoid recreating it each time we need to validate XML. This saves memory and CPU cycles.

Approach 2: If we create the SchemaFactory within the validateXmlWithXsd() method, it will become eligible for garbage collection once the method exits, and the memory occupied by that instance may be freed. So, this approach might not pose significant memory issues.

Can anyone please suggest if Approach 1 has any disadvantage over approach 2.

likeGreen
  • 1,001
  • 1
  • 20
  • 40

2 Answers2

1

It might depend on which implementation of SchemaFactory you are using; since you haven't said, I assume you're probably using the Xerces implementation (of which I have no internal knowledge). However, I would guess that the following applies to all implementations (it certainly applies to the Saxon implementation).

  1. Creating a SchemaFactory is expensive; it typically involves a search of the classpath. You only want to do it once.

  2. Creating a schema is expensive. If you use the same schema repeatedly for validating different documents (as seems likely) then you want to keep the compiled schema objects in a cache.

  3. In an application that runs 24x7, putting anything in static is undesirable; it's much better to use a single-instance service class.

So I would recommend having a single-instance SchemaValidationService class that owns the SchemaFactory; have this class maintain a cache of Schema objects; and synchronise the relevant calls to ensure thread safety.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Isn't this implying that I am using the one provided by Java SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI, "com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory", ClassLoader.getSystemClassLoader()) – likeGreen Aug 31 '23 at 12:46
  • Yes, I failed to horizontal-scroll your code far enough to see that. – Michael Kay Aug 31 '23 at 14:44
0

The documentation for SchemaFactory says that it is not thread safe:

The SchemaFactory class is not thread-safe. In other words, it is the application's responsibility to ensure that at most one thread is using a SchemaFactory object at any given moment.

If it’s possible to call validateXmlWithXsd from multiple threads, it must use a synchronized block or other locking mechanism to ensure no concurrent use of schemaFactory. This could create its own performance bottleneck.

Tim Moore
  • 8,958
  • 2
  • 23
  • 34
  • Yes, it will be called from multiple thread. So, I should either stick with Approach 2 or maybe use synchronized block for validateXmlWithXsd() – likeGreen Aug 30 '23 at 21:49
  • Personally, I would default to approach 2, unless there is an identified performance issue and a reliable benchmark test that can prove whether or not a synchronized approach 1 actually improves performance. – Tim Moore Aug 30 '23 at 22:14
  • If you do use synchronized, it’s important not to just declare the whole method synchronized, since that would still allow concurrent use of `schemaFactory` from different instances of `XmlValidator`. It would have to use `synchronized (schemaFactory)` inside the method. – Tim Moore Aug 30 '23 at 22:17