1

I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz

Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm

For instance, ł stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał, which should actually be displayed as 'kłaniał'

Now, I see 2 options to read the treebank correctly:

  1. Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
  2. First save the words in non-decoded format (e.g. as kłaniał) and then transform the characters afterwards.

Is it possible to do one of the 2 options (or both) in MATLAB?

Phil K
  • 31
  • 4
  • 1
    Did you try [xmlread](http://de.mathworks.com/help/matlab/ref/xmlread.html)? For me it automatically unescapes those characters. – swenzel Oct 03 '15 at 18:52
  • You can download the treebank I'm trying to analyze here: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz I did try 'xmlread'. Sadly, the xml-file is too huge to be opened with that function, so I'm using 'fopen' instead. – Phil K Oct 04 '15 at 00:11

1 Answers1

0

A non-MATLAB solution is to preprocess the file through some external utility. For instance, with Ruby installed, one could use the HTMLentities gem to unescape all the special characters.

sudo gem install htmlentities

Let file.xml be the filename which should consist of ascii-only chars. The Ruby code to convert the file could be like this:

#!/usr/bin/env ruby

require 'htmlentities'
xml = File.open("file.xml").read
converted_xml = HTMLEntities.new.decode xml
IO.write "decoded_file.xml", xml

(To run the file, don't forget to chmod +x it to make it executable). Or more compactly, as a one-liner

 ruby -e "require 'htmlentities';IO.write(\"decoded_file.xml\",HTMLEntities.new.decode(File.open(\"file.xml\").read))"

You could then postprocess the xml however you wish.

oarfish
  • 4,116
  • 4
  • 37
  • 66