0

I have an extracted Cyrillic content from a HTML page to a text file. The Cyrillic is OK in this file. Then I use this file to create a RDF file using Jena. Here is my code:

private void createRDFFile(String webContentFilePath) throws IOException {
    // TODO Auto-generated method stub
    Model model = ModelFactory.createDefaultModel();

    RDFWriter writer = model.getWriter("RDF/XML");
    writer.setProperty("showXmlDeclaration", "true");
    writer.setProperty("showDoctypeDeclaration", "true");
    writer.setProperty("tab", "8");
    Writer out = new BufferedWriter(new OutputStreamWriter(
            new FileOutputStream(rdfFilePath), "UTF8"));
    Resource resDest = null;
    Property hasTimeStart = model.createProperty(ns + "#hasTimeStart");
    Property distrName = model.createProperty(ns + "#distrName");
    Property moneyOneDir = model.createProperty(ns + "#moneyOneDir");
    Property moneyTwoDir = model.createProperty(ns + "#moneyTwoDir");
    Property hasTimeStop = model.createProperty(ns + "#hasTimeStop");

    BufferedReader br = new BufferedReader(new FileReader(
            webContentFilePath));
    String line = "";
    while ((line = br.readLine()) != null) {
        String[] arrayLine = line.split("\\|");
        resDest = model.createResource(ns + arrayLine[5]);
        resDest.addProperty(hasTimeStart, arrayLine[0]);
        resDest.addProperty(distrName, arrayLine[1]);
        resDest.addProperty(moneyOneDir, arrayLine[2]);
        resDest.addProperty(moneyTwoDir, arrayLine[3]);
        resDest.addProperty(hasTimeStop, arrayLine[4]);
    }
    br.close();
    model.write(System.out, "RDF/XML");
    writer.write(model, out, null);

}

When I open the RDF file the Cyrillic is like РўР РђРќРЎРљРћРџ-Р‘Р?ТОЛА. Could somebody help me?

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
vikifor
  • 3,426
  • 4
  • 45
  • 75

2 Answers2

2

The UTF-8 write encoding on the output writer looks correct, so that suggests that you're not reading webContentFilePath with the correct encoding. As a diagnostic, you could try just reading that file in and then writing it out to a plain UTF-8 file (no RDF). My guess is that you will have to be explicit about setting the file encoding on br, or ensure that the scraped web page is written out in UTF-8 to begin with.

Ian Dickinson
  • 12,875
  • 11
  • 40
  • 67
  • I was wrong. First I forgot the utf-8 encoding settings in the OutputStreamWriter, and then I didn't reload the file in the text editor I was opening with. Now in the text editor is OK, but in eclipse I still receive these strange characters. – vikifor Jul 28 '13 at 10:50
  • 2
    So your file is OK, except when you open it in Eclipse? You should set the default encoding for Eclipse to UTF-8, see http://www.eclipse.org/forums/index.php/t/29511/ for some suggestions, or other StackOverflow questions on a similar topic. – Ian Dickinson Jul 28 '13 at 11:39
1

It could be that the output is correct, but you're not seeing it correctly.

new FileReader(...) will open the file with the platform-default character set. This is not UTF-8 on Windows, so if it looks right, then you maybe viewing it in something other than UTF-8.

Jena writes in UTF-8 by default and in this case.

So when you write the file, you can not view it the same way you viewed the input. You need to view it with a UTF-8 aware viewer.

AndyS
  • 16,345
  • 17
  • 21