5

I am experimenting with the Stanford CoreNLP library, and I want to serialize the main StanfordCoreNLP pipeline object, even though it throws a java.io.NotSerializableException.

Full story: Whenever I run my implementation, it takes ~15 seconds to load the pipeline annotators and classifiers into memory. The end process is about 600MB in memory (easily small enough to be stored in my case). I'd like to save this pipeline after creating it the first time, so I can just read it into memory afterwards.

However it throws a NotSerializableException. I tried making a trivial subclass that implements Serializable, but StanfordCoreNLP has annotator and classifier properties that don't implement this interface, and I can't make subclasses for all of them.

Is there any Java library that will serialize an object that doesn't implement Serializable? I suppose it would have to recurse through it's properties and do the same for any similar object.

The serialization code I tried:

static StanfordCoreNLP pipeline;
static String file = "/Users/ME/Desktop/pipeline.sav";
    static StanfordCoreNLP pipeline() {
    if (pipeline == null) {
        try {
            FileInputStream saveFile = new FileInputStream(file);
            ObjectInputStream read = new ObjectInputStream(saveFile);
            pipeline = (StanfordCoreNLP) read.readObject();
            System.out.println("Pipeline loaded from file.");
            read.close();
        } catch (FileNotFoundException e) {
            System.out.println("Cached pipeline not found. Creating new pipeline...");
            Properties props = new Properties();
            props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            pipeline = new StanfordCoreNLP(props);
            savePipeline(pipeline);
        } catch (IOException e) {
            System.err.println(e.getLocalizedMessage());
        } catch (Exception e) {
            System.err.println(e.getLocalizedMessage());
        }
    }
    return pipeline;
}

static void savePipeline(StanfordCoreNLP pipeline) {
    try {
        FileOutputStream saveFile = new FileOutputStream(file);
        ObjectOutputStream save = new ObjectOutputStream(saveFile);
        save.writeObject(pipeline);
        System.out.println("Pipeline saved to file.");
        save.close();
    } catch (FileNotFoundException e) {
        System.out.println("Pipeline file not found during save.");
    } catch (IOException e) {
        System.err.println(e.getLocalizedMessage());
    }
}
alexchandel
  • 532
  • 6
  • 15
  • That's the best answer I found in SO: http://stackoverflow.com/a/97630/439427 (it is a reference to Effective java) – Rubens Mariuzzo Sep 04 '12 at 03:14
  • possible duplicate of [Java Serialization with non serializable parts](http://stackoverflow.com/questions/95181/java-serialization-with-non-serializable-parts) – Stephen C Sep 04 '12 at 03:16

2 Answers2

2

In general, the Stanford NLP classes that represent data objects (Tree, LexicalizedParser, etc.) are serializable, while classes that represent processors (StanfordCoreNLP, LexicalizedParserQuery, CRFClassifier) are not. To achieve what you ask for, you'd need to make a lot of classes serializable, which aren't, and to deal with any ramifications of that.

However, I think you are mistaken in your underlying thinking. The things that StanfordCoreNLP is loading during those 15 seconds are mainly standard java serialized objects. The NER classifiers and the parser grammars are standard serialized java objects. (A couple of things aren't of this form but just binary data, including for the POS tagger, largely for historical reasons.) The fact is that loading a lot of objects with standard Java serialization is not that fast ... you can find discussions on the web of the speed of Java serialization and of how the speed of alternatives compare. Making a new even larger serialized object which contains all the current serialized objects couldn't make it much quicker. (You could potentially gain a fraction by having things all in one continuous data stream, but unless you do extra work marking transient fields that don't need to be serialized, you would almost surely lose from the increased size of the serialized data structures.)

Rather, I would suggest that the key to dealing with this problem is to pay the cost of loading the system only once, and then to keep it in memory while processing many sentences.

Christopher Manning
  • 9,360
  • 34
  • 46
1

If truly the only reason that it is not serializable is that it isn't marked as Serializable, then you may be able to get away with some non-default serialization strategy. For example, you could try Jackson or XStream.

That said, if there is a good reason it is not Serializable in the first place, these strategies are likely to break in interesting ways. Test thoroughly!

Steven Schlansker
  • 37,580
  • 14
  • 81
  • 100