I am working off Kafka 2.3.0 and Spark 2.3.4. I am trying to operate with a Dataset by running a filter on it. However I get the analysis exception error and am not able to figure out the resolution. Note that the above POJO dataset by itself is good and prints well on the console. The code is in continuation of this. Do note that the POJO dataset is made from streaming data coming in via Kafka.
Looked at the column names to find mismatches if any, tried variations of the filter statement by using lambdas as well as sql. I think I'm missing something in terms of understanding to get it to work.
Here is the POJO class:
public class Pojoclass2 implements Serializable {
private java.sql.Date dt;
private String ct;
private String r;
private String b;
private String s;
private Integer iid;
private String iname;
private Integer icatid;
private String cat;
private Integer rvee;
private Integer icee;
private Integer opcode;
private String optype;
private String opname;
public Pojoclass2 (){}
...
//getters and setters
}
//What works (dataAsSchema2 is a Dataset<Row> formed out of incoming streaming data of a kafka topic):
Encoder<Pojoclass2> encoder = Encoders.bean(Pojoclass2.class);
Dataset<Pojoclass2> se= new Dataset<Pojoclass2>(sparkSession,
dataAsSchema2.logicalPlan(), encoder);
//I can print se on a console sink and it is all good. I can do all filtering on se but can only receive the return value as Dataset<Row>.
//What doesnt work(it compiles but throws the analysis exception at runtime:
Dataset<Pojoclass2> h = se
.filter((FilterFunction<Pojoclass2>) s -> s.getBuyerName() == "ASD");
//or
Dataset<Pojoclass2> h = se
.filter((FilterFunction<Pojoclass2>) s -> s.getBuyerName() == "ASD").as(Encoders.bean(Pojoclass2.class));
And the error trace (note that this is the actual. In Pojoclass2 I've changed the attribute names to protect confidentiality. You may see differences in the names, the types match):
"
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`contactRole`' given input columns: [iname, ct, icatid, s, r, b, opname, cat, opcode, dt, iid, icee, optype, rvee];;
'TypedFilter ibs.someengine.spark.somecore.SomeMain$$Lambda$17/902556500@23f8036d,
...
...
"
I expect the filter should run properly and h should contain the filtered strongly typed rows.
Currently, I am working by converting it to a DataFrame (Dataset<Row>
) but that sort of defeats the purpose (I guess).
I also noticed that very few operations on strongly typed Datasets seem to be supported that allow operating via the bean class. Is that a valid understanding?
Thanks!