0

Is there any suggestions or any help in wich way schould i go that you can advise me, to make the segmenting of the simple text in text file during converting it to xml file, such like as was before in xml. I mean, i'm converting text file into xml with jaxp+sax, like this text:

 Hello world. I am happy to see you today. 

into this xml:

 <trans-unit id="1">
            <target> Hello world</target>
        </trans-unit>
        <trans-unit id="2">
            <target> I am happy to see you today</target>
        </trans-unit>

but if i for example have source xml content that in id="1" has 3 sentences for example:

<trans-unit id="1">
            <source> Hello world. Sunny smile. Wake up early.</source>
        </trans-unit>
        <trans-unit id="2">
            <source> I am happy to see you today</source>
        </trans-unit>

and wenn i parse text from this xml i become simple text:

Hello world. Sunny smile. Wake up early.I am happy to see you today.

How can i segment this text, during converting it into xml, in order that target xml file can have also 3 sentences again? like:

<trans-unit id="1">
            <target> Hello world. Sunny smile. Wake up early.</target>
        </trans-unit>
        <trans-unit id="2">
            <target> I am happy to see you today</target>
        </trans-unit>

that is conversion txt->xml:

public void doit() {
    try {

        in = new BufferedReader(new InputStreamReader(
                new FileInputStream(file), "UTF8"));
        out = new StreamResult(selectedDir);
        initXML();
        String str;
        while ((str = in.readLine()) != null) {

        elements = str.split("\n|((?<!\\d)\\.(?!\\d))");
        for (i = 0; i < elements.length; i++)
            process(str);

         }
        in.close();
        closeXML();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public void initXML() throws ParserConfigurationException,SAXException, UnsupportedEncodingException, FileNotFoundException, TransformerException {
    // JAXP + SAX
    SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
    th = tf.newTransformerHandler();
    Transformer serializer = th.getTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    // XML ausgabe
    serializer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    th.setResult(out);
    th.startDocument();
    atts = new AttributesImpl();
    atts1 = new AttributesImpl();
    atts1.addAttribute("", "", "xlmns","CDATA", "urn:oasis:names:tc:xliff:document:1.2");    
    th.startElement("", "", "xliff", atts1);
    th.startElement("", "", "file",null);
    th.startElement("", "", "body", null);


}

public void process(String s) throws SAXException {
  try {

        atts.clear();
        k++;
        atts.addAttribute("", "", "id", "", "" + k);
        th.startElement("", "", "trans-unit", atts);
        th.startElement("", "", "target", null);
        th.characters(elements[i].toCharArray(), 0, elements[i].length());
        th.endElement("", "", "target");
        th.endElement("", "", "trans-unit");
     }
 catch (Exception e) {
        System.out.print("Out of bounds!");
    }
}
public void closeXML() throws SAXException {
    th.endElement("", "", "body");
    th.endElement("", "", "file");
    th.endElement("", "", "xliff");
    th.endDocument();
}
user2994149
  • 37
  • 1
  • 11
  • could you post the code that you have tried and where you are having difficulty – Romski Feb 02 '14 at 23:19
  • i make converting only from text to xml, but i don't know what can i do to solve that problem what i have, i can just post simply this conversion – user2994149 Feb 02 '14 at 23:26
  • Until you can show what you have tried and where you are stuck, it's difficult to help without just doing the whole thing for you. – Romski Feb 02 '14 at 23:28
  • surely you must not to solve this for me, i just want some advices in wich direction schould i go to do that what i want – user2994149 Feb 02 '14 at 23:33
  • and I just want to see what you tried so that i can provide it - have you written any code? It looks like your doing a 2 way conversion text > xml > text - is that correct? you'll need to tokenise your text and parse your xml. Both capabilities are in the standard JDK. – Romski Feb 02 '14 at 23:35
  • yes i convert first from xml to txt and then from text to xml – user2994149 Feb 02 '14 at 23:39

1 Answers1

0

It looks like you means something like:

String[] segs = elements[i].trim().split("[.!?]\\s+");
for (String seg : segs) {
    atts.clear();
    k++;
    atts.addAttribute("", "", "id", "", "" + k);
    th.startElement("", "", "trans-unit", atts);
    th.startElement("", "", "target", null);
    th.characters(seg.toCharArray(), 0, seg.length());
    th.endElement("", "", "target");
    th.endElement("", "", "trans-unit");
}

Taking segments of line-end symbol plus at least some whitespace.


After coomment, new attack: Somehow you need to immediately convert the source xml to the target xml. This can be done really simple and crude:

    boolean insideSource = false;
    StringBuilder source = null;
    String str;
    while ((str = in.readLine()) != null) {
        if (!inSource) {
            int pos = str.indexOf("<source>");
            if (pos != -1) {
                pos += "<source>".length();
                str = str.substring(0, pos);
                inSource = true;
                source = new StringBuilder();
            }
        }
        if (inSource) {
            int pos = str.indexOf("</source>");
            if (pos == -1) {
                pos = str.length();
            } else {
                inSource = false;
            }
            source.append(str.substring(0, pos));
            if (!inSource) {
                process(source.toString().trim());
                source = null;
            }
        }

Third attempt: In Java 7.

List<String> readSourcesFormXML(Path sourceXML) throws IOException { }

String[] segments(String source) {
    return source.split("(?<[.!?])\\s+"); // Or so
}

List<String> readTranslatedSegments(Path txt) throws IOException {
    return Files.readAllLines(txt, StandardCharsets,UTF_8);
}

void writeTargetsToXML(Path targetXML, Path txt, Path sourceXML) {
    List<String> sources = readSourcesFromPath(sourceXML);
    List<String> translatedSegments = readTranslatedSegments(txt);

    List<String> targets = new ArrayList<>(sources.size());
    int segmentIndex = 0;
    for (String source : sources) {
        String target = "";
        int segmentsPerSource = segments(source).length;
        while (segmentsPerSource > 0) {
            --segmentsPerSource;
            if (!target.isEmpty()) {
                target += " ";
            }
            target += segments.get(segmentIndex];
            ++segmentIndex;
        }
        targets.add(target);
    }

    writeTargetsToXML(targetXML, targets);
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • thank you for your answer, but i didn't mean that. I mean that breaking down of the text content in the same amount of sentences such as in source xml file. How could i become the same amount of sentences in target xml file during converting simple text file. – user2994149 Feb 07 '14 at 17:07
  • Sorry not understanding the problem. As still no other answer, I added something more useful hopefully. Would it not have been easier on writing the .txt doing a println (with a newline) to separate the units? – Joop Eggen Feb 07 '14 at 21:16
  • thank you very much for your effort, i try to explain again what i want to do, f.ex. i have an xml file (with an id attributes as you can see) in englisch, i have an gui app that parsed text from elements in this xml and save the file as text file, than i translate this text file into spanish f.ex, and then again convert this text file with gui app into xml file. I need the correlation between those elements from first xml and target xml files. So for example, in in element was 4 sentences, and when i parse it to text i have only text without source tag – user2994149 Feb 07 '14 at 23:41
  • and i need transfer translated text file again in xml, but how can i make this in element in xml file for example with exact 4 senteces as i explained bevor. That mean wenn i compare this two files source xml and target xml file i can see that there are same amount of sentences in both and elements. I don't know, how to make this partition and assignment of text. – user2994149 Feb 07 '14 at 23:47
  • On generating the target, read the source again; for every source elements.length (the number of segments) is the number of spanish lines to read. – Joop Eggen Feb 07 '14 at 23:48