0

I'm doing a program in Java that uses the prefuse library. The program generates graphs from information collected from twitter. I'm trying to make my program to save the generated graphs so later I can load them.

The prefuse class GraphMLWriter works fine and it generates a graphml encoded in UTF-8 and xml version: 1.0.

My problem appears when I want to load the generated graphml file. To do that I use the method readGraph(InputStream is) of the class GraphMLReader. This method return a Graph object and use a SaxParser to parse the graphml file with a handler object of the class GraphMLHandler. This object constructs the graph as de parser parse all the lines of the xml file. I'm getting a SAXParseException throwed by prefuse.data.io.DataIOException when the xml file has characters like 'á' or 'ñ' or emoticons. All the xml files generated contains Strings that represent tweets.

An example is:

<data key="info">Las extra&#241;o muchooooo a ambas! &#55357;&#56469;</data>

The error says:

Exception in thread "main" prefuse.data.io.DataIOException: >org.xml.sax.SAXParseException; lineNumber: 165; columnNumber: 67; The character reference "&#

and nothing else, it seems that the error message is cut.

These is the code that I use to save a graph object 'g' into and GraphML called "Saved graph":

(new GraphMLWriter()).writeGraph(graph, "Graph saved"); 

And these is the one wich I use to load the graph into a graph 'g2' generated from a GraphML file called "Graph saved"

Graph g2 = (new GraphMLReader().readGraph("Graph saved")); 

What can I do to resolve this problem?

Abraham Simpson
  • 163
  • 1
  • 7
  • Characters like 'á' or 'ñ' (= ñ) seemed to work when I tested. are causing problems in SAX (not prefuse). – alex.rind Dec 13 '16 at 15:14
  • Yeah you are right, I had not noticed that. Thanks. When I generate the graph in the moment (recovering Twitter information), I show the tweets with a JPopUpMenu and it recognice emoticons. So how can I put the emoticon in the graphML file so later, when I load the graph, show it? – Abraham Simpson Dec 13 '16 at 17:11

1 Answers1

0

&#55357 and &#56469 are surrogate parts, so I'm guessing your original data contains some extended unicode characters. It appears that the prefuse GraphMLWriter creates an XMLWriter that makes some assumptions about encodings that aren't necessarily correct - it assumes that all characters in a String are 16 bit code points and encodes them accordingly. In this case we appear to have a surrogate pair and some smarter handling is required (to be fair to the original author, seeing such values in the wild in 2005/2006 was somewhat unusual, and pretty much everyone assumed that Unicode meant 16 bits per character).

Regardless, I think the only options you have here are to pre-filter your data, or patch the prefuse library. If you adverse to forking, one approach would be to extend GraphMLWriter and override writeGraph with an almost exact copy substituting the creation of XMLWriter on line 73 with the creation of your own extended XMLWriter in which you override escapeString to deal with the surrogates properly. Java's Character class provides methods that tell you if a char is a surrogate, and if a pair of a characters make a valid surrogate pair - if you find such a pair you can then generate the correct XML entity.

James Fry
  • 1,133
  • 1
  • 11
  • 28