I have a big xml document 250mb, which one of the tags contains another xml that I need to process.
But the problem is, this xml is wrapped by CDATA
and if I try to do a replace/replaceAll
String xml= fileContent.replace("<![CDATA[", " ");
String replace = xml.replace("]]>", " ");
I'm gettig
java.lang.OutOfMemoryError: Java heap space
A simple example of the structure.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>
<b>
<c>
<![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?><bigXML>]]>
</c>
</b>
</a>
Even using XML parser like VDT
or SAX
it does not help because I still need to remove the <![CDATA[
and what we have inside there is the biggest portion of the file.
Allocate more memory heap is not an option since is running in a machine where I dont have any JVM control.
Anny idea how to extract the xml from c
tag and also extract from <![CDATA[
UPDATE
I tried make the modification using Streams as we discuss bellow but still I'm having outOfMemories
.
Any idea how to improve the code to avoid the error?
private void readUpdateAndWrite(
Reader reader,
String absolutePath
) {
// Write the content in file
try (BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(absolutePath))) {
// Read the content from file
try (BufferedReader bufferedReader = new BufferedReader(reader)) {
String line = bufferedReader.readLine();
while (line != null) {
String replace = line
.replace("<![CDATA[", " ")
.replace("]]>", " ");
bufferedWriter.write(replace);
line = bufferedReader.readLine();
}
} catch (IOException e) {
logger.error("Error writing in file. Caused by {}", getStackTrace(e));
}
} catch (IOException e) {
logger.error("Error reading in file. Caused by {}", getStackTrace(e));
}
}
I found my problem. The content of <![CDATA[
is one String line of 256mb so I cannot make any replace in that line, or I get the outOfMemory
.
How can I break a String of 256mb into new lines. I tried to create another InputStream
through the massive String, but is not working.
I guess is because is an embedded XML and we cannot have multiline.