0

I read a parquet file and I get a Dataset containing 57 columns.

Dataset<Row> ds = spark.read().parquet(locations);

I would like to use a custom type instead of Row. I have defined a java bean such as

import lombok.Getter;
import lombok.Setter;

import java.io.Serializable;

@Getter
@Setter
public class MyBean implements Serializable {
    MyBean (){ }
    private String col1;
    private String col2;
    private String col3;
 }

The field names are perfectly matching the column names in the parquet file (which is annoying because they are in snake case and I would like camelCase in my POJO but that's not the main issue here).

To convert Dataset<Row> to Dataset<MyBean> I use ds.as(Encoders.bean(MyBean.class)

However if I try to show the Dataset<MyBean> produced it has all 57 columns. I was expecting 3 columns corresponding to MyBean, hoping it would only read the pieces of the parquet that are of interest to me.

At first I thought as was a transformation and not an action but show is definitely an action, and I also tried .cache().count() beforehand just in case.

What am I missing about Encoders here ?

Calimero
  • 161
  • 1
  • 8

2 Answers2

1

You didn't misunderstand or miss anything about Encoders except perhaps that they, like Datasets, are also lazily involved. If you manually create the setters and put breakpoints on them they'll not be called with a simple as().show.

In fact you can just use read.as(beanEncoder).map(identity, beanEncoder).show and see that it's only using those columns.

Similar to Oli's point if you do .explain on the result of just "as" (with no select to reduce fields), you'll see all the fields mentioned in the LocalRelation or Scan results. Add the map and explain and you'll see the fields in the bean.

Using a dataset, even after using .as, will let you select fields that aren't in your mean as well, it's only really becoming just the bean after something which forces the read structure to change.

Oli
  • 9,766
  • 5
  • 25
  • 46
Chris
  • 1,240
  • 7
  • 8
  • Sorry I didn't understand what you meant with .map(identity, beanEncoder) ? What would be "identity" here ? Judging by the function signature it's a function inputing and outputing the bean type ? If Encorders are lazy, why didn't a count() solved the issue ? Is it a different "moment of lazyness" than the Dataset itself ? – Calimero Aug 31 '23 at 08:30
  • identity functions are just a -> a, spark has no way to generically know you didn't change anything so it invokes the decode and encode logic. count() doesn't need to do anything with an encoder so it fits the lazy definition. It's an extra level of lazy on top of the dataset, as you saw with count – Chris Aug 31 '23 at 09:03
  • I tinkered a bit with the idea and figured it out, thanks a lot – Calimero Aug 31 '23 at 09:22
0

If you use select after spark.read.parquet, only the selected columns will be read, not the others.

Therefore, athough it is a bit more verbose,

ds.select("col1", "col2", "col3").as(Encoders.bean(MyBean.class)

should achieve what you want.

Feel free to try ds.select("col1", "col2", "col3").explain() and check that the ReadSchema only contains the selected columns. Note that the same goes for filters. If you filter just after reading a parquet file, the filter will be "pushed" and only the selected rows will be read.

Oli
  • 9,766
  • 5
  • 25
  • 46
  • I was thinking that the Bean is already a definition of the schema and writing a select would have been a duplication of the information to maintain in sync, therefore I was expecting Encoders were solving those issues. Would it be sound to extract the schema from the bean like `Arrays.stream(MyBean.class.getDeclaredFields()).map(x -> String.format("%s %s", x.getName(), x.getType())).collect(Collectors.joining(","));` to use with the Schema argument of the load method of a `DataframeReader` ? – Calimero Aug 30 '23 at 07:35