0

I am looking at production code in hadoop framework which does not make sense. Why are we using transient and why can't I make the utility method a static method (was told by the lead not to make isThinger a static method)? I looked up the transient keyword and it is related to serialization. Is serialization really used here?

//extending from MapReduceBase is a requirement of hadoop
public static class MyMapper extends MapReduceBase {

    // why the use of transient keyword here?
    transient Utility utility;

    public void configure(JobConf job) {

        String test = job.get("key");

        // seems silly that we have to create Utility instance.
        // can't we use a static method instead?
        utility = new Utility();

        boolean res = utility.isThinger(test);

        foo (res);
    }

    void foo (boolean a) { }
}


public class Utility {
   final String stringToSearchFor = "ineverchange";

   // it seems we could make this static.  Why can't we?
   public boolean isThinger(String word) {
      boolean val = false;
      if (word.indexOf(stringToSearchFor) > 0) {
           val = true;
      }
      return val;
   }
}
MedicineMan
  • 15,008
  • 32
  • 101
  • 146

2 Answers2

2

The problem in your code is the difference between the local mode (dev&testcases using it usually) and the distributed mode.

In the local mode everything will be inside a single JVM, so you can safely assume that if you change a static variable (or a static method that shares some state, in your case stringToSearchFor) the change will be visible for the computation of every chunk of input.

In distributed mode, every chunk is processed in its own JVM. So if you change the state (e.G. in stringToSearchFor) this won't be visible for every other process that runs on other hosts/jvms/tasks.

This is an inconsistency that leads to the following design principles when writing map/reduce functions:

  1. Be as stateless as possible.
  2. If you need state (mutable classes for example), never declare references in the map/reduce classes static (otherwise it will behave different when testing/develop than in production)
  3. Immutable constants (for example configuration keys as String) should be defined static and final.

transient in Hadoop is pretty much useless, Hadoop is not serializing anything in the usercode (Mapper/Reducer) class/object. Only if you do something with the Java serialization which we don't know of, this will be an issue.

For your case, if the Utility is really a utility and stringToSearchFor is an immutable constant (thus not be changed ever), you can safely declare isThinger as static. And please remove that transient, if you don't do any Java serialization with your MapReduceBase.

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
  • Thanks for the tips on writing MR functions. I've made stringToSearchFor a constant. I don't believe the static method contains any state. – MedicineMan Feb 07 '13 at 16:38
0

Unless there is something not shown here, then I suspect that the matter of making Utility a static method largely comes down to style. In particular, if you are not injecting the Utility instance rather than instantiating it on demand within, then it is rather pointless. As it is written, it cannot be overridden nor can it be more easily tested than static method.

As for transient, you are right that it is unnecessary. I wouldn't be surprised if the original developer was using Serialization somewhere in the inheritance or implementation chain, and that they were avoiding a compiler warning by marking the non-serializable instance variable as transient.

pickypg
  • 22,034
  • 5
  • 72
  • 84
  • we are not injecting the utility, and also we instantiating it on demand, within. Have you considered that configure() is called within the context of a hadoop map job? Could that change things? – MedicineMan Feb 07 '13 at 04:47
  • It shouldn't. The only way that a `static` method will possibly change is if a different `ClassLoader` is used when configuring the next time through that provides a different version of the `static` method (by replacing the `class`) and the same goes for the `Utility` class as well. Because it's directly instantiated on demand, there is no mechanism for replacement (so, potentially extending `Utility` later provides no potential benefit to the above code). This provides very little benefit, and appears to be a waste of memory to even hang onto as an instance variable. – pickypg Feb 07 '13 at 06:27
  • It's probably also worth noting that [`MapReduceBase` is `Deprecated`](http://stackoverflow.com/questions/7626077/mapreducebase-and-mapper-deprecated). – pickypg Feb 07 '13 at 06:31
  • It is undeprecated in the newer API's, because there would be too much usercode to be rewritten. Both API's are pretty much compatible and up2date. – Thomas Jungblut Feb 07 '13 at 08:07
  • @Thomas Good point. Clearly shows my inexperience with Hadoop. (The answer in there noted that it was undeprecated, and then re-deprecated, but checking the 1.x docs shows that it is still undeprecated). – pickypg Feb 07 '13 at 17:50