I have a spout class that has several integer and string attributes, which are serialized/deserialized as expected. The class also has 1 LinkedList holding byte arrays. This LinkedList is always empty when an object is deserialized.
I've added log statements into all of the spout methods and can see the spout's 'activate' method being called, after which, the LinkedList is empty. I do not see any logs when this happens for the 'deactivate' method.
It seems odd that the spout 'activate' method is being called without the 'deactivate' method having been called. When the 'activate' method is called, there has not been any resubmission of the topology.
I also have a log statement in the spout constructor, which is not called prior to the LinkedList being emptied.
I've also verified repeatedly that there are no calls anywhere within the spout class to any method that would completely empty the LinkedList. There is 1 spot that uses the poll method, which is immediately followed by a log statement to log the new LinkedList size.
I found this reference, which points to Kryo being used for Serialization, but it may just be for serializing tuple data. http://storm.apache.org/documentation/Serialization.html
Storm uses Kryo for serialization. Kryo is a flexible and fast serialization library that produces small serializations.
By default, Storm can serialize primitive types, strings, byte arrays, ArrayList, HashMap, HashSet, and the Clojure collection types. If you want to use another type in your tuples, you'll need to register a custom serializer.
The article makes it sound like Kryo may be just for serializing and passing tuples, but if it is for the Spout object as well, I can't figure out how to then use a LinkedList as ArrayLists and HashMaps aren't really a good alternative for a FIFO queue. Will I have to roll my own LinkedList?
public class MySpout extends BaseRichSpout
{
private SpoutOutputCollector _collector;
private LinkedList<byte[]> messages = new LinkedList<byte[]>();
public MyObject()
{
queue = new LinkedList<ObjectType>();
}
public void add(byte[] message)
{
messages.add(message);
}
@Override
public void open( Map conf, TopologyContext context,
SpoutOutputCollector collector )
{
_collector = collector;
try
{
Logger.getInstance().addMessage("Opening Spout");
// ####### Open client connection here to read messages
}
catch (MqttException e)
{
e.printStackTrace();
}
}
@Override
public void close()
{
Logger.getInstance().addMessage("Close Method Called!!!!!!!!!!!!!!!!!");
}
@Override
public void activate()
{
Logger.getInstance().addMessage("Activate Method Called!!!!!!!!!!!!!!!!!");
}
@Override
public void nextTuple()
{
if (!messages.isEmpty())
{
System.out.println("Tuple emitted from spout");
_collector.emit(new Values(messages.poll()));
Logger.getInstance().addMessage("Tuple emitted from spout. Remaining in queue: " + messages.size());
try
{
Thread.sleep(1);
}
catch (InterruptedException e)
{
// TODO Auto-generated catch block
Logger.getInstance().addMessage("Sleep thread interrupted in nextTuple(). " + Logger.convertStacktraceToString(e));
e.printStackTrace();
}
}
}
}
EDIT:
Java Serialization of referenced objects is "losing values"? http://www.javaspecialists.eu/archive/Issue088.html
The above SO link and the java specialists article call out specific examples similar to what I am seeing and the issue is do the serialization/deserialization cache. But because Storm is doing that work, I'm not sure what can be done about the issue.
At the end of the day, it also seems like the bigger issue is that Storm is suddenly serializing/deserializing the data in the first place.
EDIT:
Just prior to the Spout being activated, a good number log messages come through in less than a second that read:
Executor MyTopology-1-1447093098:[X Y] not alive
After those messages, there is a log of:
Setting new assignment for topology id MyTopology-1-1447093098: #backtype.storm.daemon.common.Assignment{:master-code-dir ...