JAVA how to find and delete the structure of sentences?

Question

I have a xml file, and its structure is like this.

 <?xml version="1.0" encoding="MS949"?> 
 <pmd-cpd>
    <duplication lines="123" tokens"123"> 
        <file line="1" path="..">
        <file line="1" path="..">
        <codefragment><![CDATA[........]]></codefragment>
    </duplication>
    <duplication>
    ...
    </duplication>
 </pmd-cpd>

I want to delete 'codefragment' node, because my parser make an error 'invalid XML character(0x1). '

My parsing code is like this,

private void parseXML(File f){
      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
      DocumentBuilder builder = null;
      Document document = null;
    try {
        builder = factory.newDocumentBuilder();
        document = builder.parse(f);
     }catch(...)

The error happens in document = builder.parse(f); so I cannot use parser to delete the codefragment node.

This is why I want to delete these lines without the parser.

How can I delete this node without the parser...?

have you try this ? http://stackoverflow.com/questions/8489151/how-to-parse-xml-for-cdata — guillaume girod-vitouchkina, Nov 27 '15 at 08:38
thanks but that does not fit to my problem.. to use that I should parse my file but in my case I cant even parse my file because of the invaild character — cointreau, Nov 27 '15 at 08:41
then it's not valid XML. try to use a regex before to delete [CDATA ...] . Not totally safe — guillaume girod-vitouchkina, Nov 27 '15 at 08:45
If the XML file contains bad characters, you should fix whatever program created the file. Valid character data is defined by the [XML Specification](http://www.w3.org/TR/REC-xml/#NT-Char) to be `#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]`, which means that `#x1` (aka `0x1`) is **not** valid. — Andreas, Nov 27 '15 at 08:53
It seems to be more of an encoding problem. The character that is causing the problem is not supported by the encoding="MS949". You could try a different encoding such as encoding="UTF-8" — Mike Murphy, Nov 27 '15 at 08:55
@MikeMurphy `0x1` *is* defined by MS949. It is however not allowed by XML. — Andreas, Nov 27 '15 at 08:59
@MikeMurphy I'm sorry to late reply, but I could solve the problem. How I can solve was to delete the wrong characters and then parse that. — cointreau, Nov 30 '15 at 03:06
@Andreas Yes, my xml file had bad characters but I could not the program which created that file, because it's not my program and even do not have the source.(that was PMD, static analyzer) Thank you for your help..! — cointreau, Nov 30 '15 at 03:08

score 1 · Accepted Answer · answered Nov 30 '15 at 03:48

This is a followup answer to OP's self-answer, and the comment I made to that answer. Here's the recap, plus some extra:

Never do String += String in a loop. Use StringBuilder.
Read the XML in blocks, not lines.
Don't use String.replaceAll(). It has to recompile the regex every time, a regex you already have. Use Matcher.replaceAll().
Remember to close() the Reader. Better yet, use try-with-resources.
No need to save the clean XML back out, just use it directly.
Since XML is usually in UTF-8, read the file as UTF-8.
Don't print and ignore errors. Let caller handle errors.

private static void parseXML(File f) throws IOException, ParserConfigurationException, SAXException {
    StringBuilder xml = new StringBuilder();
    try (BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(f),
                                                                      StandardCharsets.UTF_8))) {
        Pattern badChars = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]+");
        char[] cbuf = new char[1024];
        for (int len; (len = in.read(cbuf)) != -1; )
            xml.append(badChars.matcher(CharBuffer.wrap(cbuf, 0, len)).replaceAll(""));
    }
    DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder domBuilder = domFactory.newDocumentBuilder();
    Document document = domBuilder.parse(new InputSource(new StringReader(xml.toString())));
    // insert code using DOM here
}

Thank you! this is much faster. – cointreau Nov 30 '15 at 10:12 — cointreau, Nov 30 '15 at 10:12

score 0 · Answer 2 · answered Nov 30 '15 at 03:14

How I solved this problem was, to remove the bad characters such as x01, save as new XML file, and then parse the new file.

Because I could not even parse my old xml file, I could not remove the node with parser.

So removing invalid character and saving as a new file code was like this.

//save the xml string as a new file.
public static Document stringToDom(String xmlSource) 
        throws SAXException, ParserConfigurationException, IOException {
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    return builder.parse(new InputSource(new StringReader(xmlSource)));
}

//get the file and remove bad characters in it
private static void cleanString(File fileName) {
    try {
        BufferedReader in = new BufferedReader(new FileReader(fileName));
        String xmlLines, cleanXMLString="";
        Pattern p = null;
        Matcher m = null;

        p = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]");
        while (((xmlLines = in.readLine()) != null)){
            m = p.matcher(xmlLines);
            if (m.find()){
                cleanXMLString = cleanXMLString + xmlLines.replaceAll("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]", "")+"\n";
            }else
                cleanXMLString = cleanXMLString + xmlLines+"\n";
        }

        Document doc = stringToDom(cleanXMLString);
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(doc);
        StreamResult result =  new StreamResult(new File("\\new\\"+fileName.getName()));
        transformer.transform(source, result);

    } catch (IOException | SAXException | ParserConfigurationException | TransformerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

Maybe, that's not good method since it takes quite long time for even a small file(under 5MB).

But if your file is small, you can try this...

Never do `String += String` in a loop. Use `cleanXml = new StringBuilder()`. It's faster. --- Since you don't care about lines, read in blocks, as in `readLen = in.read(cbuf)`. It's faster. --- Don't use that `replaceAll()`. It has to recompile the regex every time, a regex you already have. Just use `cleanXml.append(p.matcher(new String(cbuf, 0, readLen)).replaceAll(""))`. It's faster. --- Remember to `in.close()` the reader. Better yet, use try-with-resources. --- No need to DOM load the XML. Just use a `StreamSource` from the String. — Andreas, Nov 30 '15 at 03:22

JAVA how to find and delete the structure of sentences?

2 Answers2