We have been trying to split a HUGE 7GB xml in files, and so far none of the options tried have been promising. Let me explain:
There is a file, that comes from an external user so we cannot change it. In order to be loaded in the database, it is needed to split it.
After checking, informatica has ~4400 ports, meaning that there are at least 4400 nodes on every item. The file is cut in 11 different files.
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<file>
<fileHeader>This has some information</fileHeader>
<fileBody>
<Item id="1">
<definition>
<id>1</id>
<name>Something</name>
<description>This is a dummy</description>
</definition>
<raw_materials>
<material>
<name>polycarbonate</name>
<description>Something to describe</description>
<cost>24.33</cost>
<units>LB</units>
</material>
<material txt="this" />
<material txt="had to" />
<material txt="be splitted" />
<material txt="into 3" />
<material txt="different files"/>
</raw_materials>
<specs>
<rating_usa issuer_id="3">A</rating_usa>
<rating_cnd issuer_id="9">10</rating_cnd>
<rating_br issuer_id="5">24.12</rating_bra>
</specs>
<budget>
<budget_usa>
<amount>465</amount>
<currency>USD</currency>
<usd_vs>1</usd_vs>
</budget_usa>
<budget_cnd>
<amount>30</amount>
<currency>CND</currency>
<usd_vs>1.24</usd_vs>
</budget_cnd>
<budget_bra>
<amount>20</amount>
<currency>BRP</currency>
<usd_vs>17.31</usd_vs>
</budget_bra>
</budget>
<vendor>
<id>1HR24ZA</id>
<vendorName>Vendor</vendorName>
<deliveryRate>9.5</deliveryRate>
<location>
<country>Italy</country>
<address>Lamborghini Str. 245</address>
<phone>1234</phone>
</location>
</vendor>
<taxes>
<tax>
<country>MEX</country>
<federal_pct>16</federal_pct>
<currency>MXN</currency>
<pct_price>5</pct_price>
</tax>
<tax txt="this also"/>
<tax txt="contains too"/>
<tax txt="much nodes"/>
</taxes>
</Item>
<Item id="2">
</Item>
</fileBody>
</file>
Here it only has 6 mayor tags per item (definition, raw_materials, specs, budget, vendor, taxes), but actually it has 9.
The original mapping is something like this: Source -> Source Qualifier -> Target (XML)
To try to solve the problem, the settings were changed, but there was not significant improvement. After that, every file was put in a workflow within a task, and all the tasks were put in parallel. The final time, was the same as the original.
After that, java was tried. DOM is not an option because it loads the file in memory. Then, SAX and StAX were tried, StAX showed a better performance than SAX, so we went over that direction.
Is worth mentioning that the final files on informatica, have something like this:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<file>
<fileHeader>This has some information</fileHeader>
<fileBody>
<Item id="1">
<raw_materials>
<material>
<name>polycarbonate</name>
<description>Something to describe</description>
</material>
<material txt="this" />
<material txt="is hardcore" />
</raw_materials>
</file>
As you can see, you've to check for specific tags to be in the file. So, you end up checking for around 200 tags everytime that a new tag is coming and you do it for every file you want to put that tag into:
public class XMLCopier implements javax.xml.stream.StreamFilter {
static boolean allowStream = false;
static boolean tagFinished = false;
private static boolean isWithinValidTag = false;
private static Map tagMap = new HashMap();
private static String currentTag = "";
public static void main(String[] args) throws Exception {
String filename = "/path/to/xml/xmlInput.xml";
String fileOutputName = "/path/to/target/finalXML.xml";
try
{
XMLInputFactory xmlif = null;
xmlif = XMLInputFactory.newInstance();
FileInputStream fis = new FileInputStream(filename);
XMLStreamReader xmlr = xmlif.createFilteredReader(xmlif.createXMLStreamReader(fis),new XMLCopier());
OutputStream outputFile = new FileOutputStream(fileOutputName);
XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();
XMLStreamWriter xmlWriter = outputFactory.createXMLStreamWriter(outputFile);
while (xmlr.hasNext())
{
write(xmlr, xmlWriter);
xmlr.next();
}
write(xmlr, xmlWriter);
xmlWriter.flush();
xmlWriter.close();
xmlr.close();
outputFile.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
public boolean accept(XMLStreamReader reader) {
int eventType = reader.getEventType();
if ( eventType == XMLEvent.START_ELEMENT )
{
String currentName = reader.getLocalName();
if (isWithinValidTag)
if ( ( (List)tagMap.get(currentTag) ).contains(currentName) )
{
allowStream = true;
}
if ( tagMap.containsKey(currentName) )
{
isWithinValidTag = true;
currentTag = currentName;
allowStream = true;
}
}
return allowStream;
}
private void write(XMLStreamReader xmlr, XMLStreamWriter writer) throws XMLStreamException
{
switch (xmlr.getEventType()) {
case XMLEvent.START_ELEMENT:
final String localName = xmlr.getLocalName();
writer.writeStartElement(localName);
break;
}
}
When we tried to do it on a single class, we ended with a code hard to maintain, and it took around 5 minutes less to complete than the informatica process. Then we splitted the class to be run in parallel, but it doesn't look promising as it ran in 7 minutes less than the informatica process, maybe because you're performing a search of 200 tags, on 4400 nodes. 11 times.
As you can see, this is not about how to make something, is about how to make something fast.
Do you have any idea on how could we improve the file split?
PD. The server has JVM 1.4.2, so we have to stick to that. PD2. Here it only shows an item, in the real file it has many.