-1

I am implementing a spark process in java, and want to make, from a RDD of the same parametrized type, a Dataset<Try<MyPojo>> for some own made MyPojo class and where Try is the scala Try. In scala, the encoder would be made implicitly, but in java I need to provide it explicitly.

Now, I can get a working Encoder<MyPojo> using Encoders.bean(MyPojo.class). And I expect that there is some code to build an Encoder<Try<T>> from an Encoder<T> that is used by the scala implicit. But I cannot find it.

[Note: I just tried in scala and no implicit was found for type Try... So the question is valid in scala too]

So, how am I supposed to do?

Juh_
  • 14,628
  • 8
  • 59
  • 92

1 Answers1

0

After some search I reached to conclusion that

  1. it is not possible (or maybe but it would be overly complicated)
  2. and that's because it is not the way to use Dataset

At first, I considered Dataset to be a super, more generic, version of RDD. But it is not. Actually, it is less generic with respect to type because the type stored in dataset should be "flat" or "flatten-able".

Traditional Pojo have either a flat structure (each field has a value type that can be represented by one column) or can be flatten when fields has a Pojo type. On the other hand, there is no trivial way to "flatten" a type such as Try, which is basically either some type (MyPojo in the question) or an Exception.

And that conclusion also applies on all none-pojo type, such as interfaces which can have several implementation. Obviously this leads to a question: what about classes that are not pojo, eg. because that contains field of Try or interface type. Probably that Encoders.bean would fail at runtime. So much for type-safety...

Well, in conclusion, to solve my problem which is to keep track of failed items, I think I will go for an addition of an "error" column. Or something like that.

Juh_
  • 14,628
  • 8
  • 59
  • 92