I read a parquet file and I get a Dataset containing 57 columns.
Dataset<Row> ds = spark.read().parquet(locations);
I would like to use a custom type instead of Row. I have defined a java bean such as
import lombok.Getter;
import lombok.Setter;
import java.io.Serializable;
@Getter
@Setter
public class MyBean implements Serializable {
MyBean (){ }
private String col1;
private String col2;
private String col3;
}
The field names are perfectly matching the column names in the parquet file (which is annoying because they are in snake case and I would like camelCase in my POJO but that's not the main issue here).
To convert Dataset<Row>
to Dataset<MyBean>
I use ds.as(Encoders.bean(MyBean.class)
However if I try to show
the Dataset<MyBean>
produced it has all 57 columns. I was expecting 3 columns corresponding to MyBean, hoping it would only read the pieces of the parquet that are of interest to me.
At first I thought as
was a transformation and not an action but show
is definitely an action, and I also tried .cache().count() beforehand just in case.
What am I missing about Encoders here ?