1

I have a PTable<String, Pair<Entity1, Entity2>> which is generated at an intermediate stage of program on which i am running transformation job. Sample PTable entry:

["0067b4c054d14fe2-ACC8D37", 
 [{
    "unique_id": "0067b4c054d14fe2-ACC8D37",
    "user_id": "ACC8D37",
    "campaign_id": "3URL6GC2",
    "seller_id": "0067b4c"
}, {
    "unique_id": "0067b4c054d14fe2-ACC8D37",
    "user_id": "ACC8D37",
    "seller_id": "0067b4c"
}]]

I neeed to get a PCollection<String> where Entity2 is null

Transformer DoFn

private PCollection<String> transformPairTable(PCollection<Pair<Entity1, Entity2>> pairPCollection) {
        return pairPCollection.parallelDo(new DoFn<Pair<Entity1, Entity2>, String>() {
            @Override
            public void process(Pair<Entity1, Entity2> entityPair, Emitter<String> emitter) {
                Entity1 entity1 = entityPair.first();
                Entity2 entity2 = entityPair.second();
                if (entity2 == null) {
                    String campaignId = entity1.getCampaignId();
                    emitter.emit(campaignId);
                }
            }
        }, Avros.strings());
    }

Entity1 and Entity2 both are Generated classes from Avro schema.

But when i run the job it throws runtime exception

Exception in thread "main" org.apache.crunch.CrunchRuntimeException: java.io.NotSerializableException: com.xxx.bb.data.vv.jobs.GenerateCount
    at org.apache.crunch.impl.mr.MRPipeline.plan(MRPipeline.java:140)
    at org.apache.crunch.impl.mr.MRPipeline.runAsync(MRPipeline.java:159)
    at org.apache.crunch.impl.mr.MRPipeline.run(MRPipeline.java:147)
    at org.apache.crunch.materialize.MaterializableIterable.iterator(MaterializableIterable.java:94)
    at org.apache.crunch.materialize.pobject.FirstElementPObject.process(FirstElementPObject.java:44)
    at org.apache.crunch.materialize.pobject.PObjectImpl.getValue(PObjectImpl.java:71)
    at com.xxxx.bb.xxxx.xxx.jobs.GenerateNewCount.getNewCustomersPCollection(GenerateNewCount.java:239)
    at com.xxx.x.data.x.jobs.GenerateNewCount.runJob(GenerateNewCount.java:78)
    at com.xxx.x.data.xx.jobs.BaseJob.run(BaseJob.java:87)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

I tried Writables.strings() but it gave same exception.

The PTable which is used in the job has entries with Entity2 equals to null.

I tried transforming the PTable in many ways but its not working. i am not able to figure out the main reason behind it.

When i use

Entity1 p1= pairPCollection.first().getValue().first();

It throws exception mentioned below:

ERROR materialize.MaterializableIterable: Could not materialize: Avro(/tmp/crunch-1893789994/p7)
java.io.IOException: No files found to materialize at: /tmp/crunch-1893789994/p7
    at org.apache.crunch.io.CompositePathIterable.create(CompositePathIterable.java:49)
    at org.apache.crunch.io.impl.FileSourceImpl.read(FileSourceImpl.java:134)
    at org.apache.crunch.io.avro.AvroFileSource.read(AvroFileSource.java:89)
mukul
  • 433
  • 7
  • 18
  • can you add this code ? at com.xxxx.bb.xxxx.xxx.jobs.GenerateNewCount.getNewCustomersPCollection(GenerateNewCount.java:239).. there is your problem.. probably trying to pass through a constructor a non serializable object. – hlagos Sep 18 '17 at 15:42
  • I was using DoFn inside a non static function due to which the error was coming. DoFn instances that are defined inline inside of static methods do not contain references to their outer classes, and so can be serialized regardless of whether the outer class is serializable or not. Changing the method from Non static to static solved the problem. – mukul Sep 23 '17 at 03:42

0 Answers0