2

I want to implement a class have a function that read from hbase by spark, like this:

public abstract class QueryNode implements Serializable{
  private static final long serialVersionUID = -2961214832101500548L;
  private int id;
  private int parent;
  protected static Configuration hbaseConf;
  protected static Scan scan;
  protected static JavaSparkContext sc;
  public abstract RDDResult query();
  public int getParent() {
      return parent;
  }

  public void setParent(int parent) {
      this.parent = parent;
  }

  public int getId() {
      return id;
  }

  public void setId(int id) {
      this.id = id;
  }
  public void setScanToConf() {
     try {
          ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
          String scanToString = Base64.encodeBytes(proto.toByteArray());
          hbaseConf.set(TableInputFormat.SCAN, scanToString);
      } catch (IOException e) {
        e.printStackTrace();
      }
  }}

This is a parent class, i hava some subclasses implement the menthod query() to read from hbase , but if I set Configuration, Scan and JavaSparkContext is not static, I will get some errors : these classes are not serialized.

Why these classes must be static? Have I some other ways to slove this problem? thks.

VladoDemcak
  • 4,893
  • 4
  • 35
  • 42
yaowin
  • 23
  • 3

1 Answers1

0

You can try to set transient for these fields to avoid serialization exception like

Caused by: java.io.NotSerializableException: org.apache.spark.streaming.api.java.JavaStreamingContext

so you say to java you just dont want to serialize these fields:

  protected transient Configuration hbaseConf;
  protected transient Scan scan;
  protected transient JavaSparkContext sc;

Are you initializing JavaSparkContext, Configuration and Scan in main or in any static method? With static, your fields are shared through all instancies. But it depends on your use cases if static should be used.

But with transient way it is better than static because serialization of JavaSparkCOntext does not make sense cause this is created on driver.


-- edit after discussion in comment:

java doc for newAPIHadoopRDD

public <K,V,F extends org.apache.hadoop.mapreduce.InputFormat<K,V>> JavaPairRDD<K,V> newAPIHadoopRDD(org.apache.hadoop.conf.Configuration conf,
                                                                                            Class<F> fClass,
                                                                                            Class<K> kClass,
                                                                                            Class<V> vClass)

conf - Configuration for setting up the dataset. Note: This will be put into a Broadcast. Therefore if you plan to reuse this conf to create multiple RDDs, you need to make sure you won't modify the conf. A safe approach is always creating a new conf for a new RDD.

Broadcast:

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

So basically I think for that case static is ok (you create hbaceConf only once), but if you want to avoid static, you can follow suggestion in javadoc to always craete a new conf for a new RDD.

VladoDemcak
  • 4,893
  • 4
  • 35
  • 42
  • i test the transient to hbaseconf , but in cluster mode , it will happen to the error that class not found for hbase configuration, if i use static , don't hava this problem. do you know the general process that spark connect hbase? i understand if i don't use static , every node should create the hbaseconf in local JVM, why spark need to serialize hbaseconf? – yaowin Oct 15 '16 at 08:34
  • 1, Are you using [newAPIHadoopRDD](https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)) in `public abstract RDDResult query();` implementation in the child? 2, you said "class not found for hbase configuration" - which class was not found? – VladoDemcak Oct 15 '16 at 09:50
  • 1. yes, i use newAPIHadoopRDD in query() implementation in the child. – yaowin Oct 15 '16 at 12:48
  • 2. the class is inline 'org.apache.hadoop.conf.Configuration' – yaowin Oct 15 '16 at 12:50
  • 2. the class is 'org.apache.hadoop.conf.Configuration' – yaowin Oct 15 '16 at 12:51
  • @yaowin check my answer I added some points. hbaseconf is a broadcast variable for `newApiHadoopRDD` – VladoDemcak Oct 15 '16 at 13:46