I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz
Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm
For instance, ł
stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał
, which should actually be displayed as 'kłaniał'
Now, I see 2 options to read the treebank correctly:
- Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
- First save the words in non-decoded format (e.g. as
kłaniał
) and then transform the characters afterwards.
Is it possible to do one of the 2 options (or both) in MATLAB?