2

I've always wandered if having a Dataset of a parameterised/generic class is possible in Java. To be more clear, what I am looking to achieve is something like this:

Dataset<MyClass<Integer>> myClassInteger;
Dataset<MyClass<String>> myClassString;

Please let me know if this is possible. If you could also show me how to achieve this, I would be very appreciative. Thanks!

2 Answers2

1

Sorry this question is old, but I wanted to put some notes down since I was able to work with generic/parameterized classes for Datasets in java by creating a generic class that took a type parameter, and subsequently put methods inside that parameterized class. Ie, class MyClassProcessor<T1> where T1 could be Integer or String.

Unfortunately, you will not enjoy full benefits of generic types in this case, and you will have to perform some workarounds:

  • I had to use Encoders.kryo(), otherwise the generic types became Object with some operations and could not be cast correctly to the generic type.
    • This introduces some other annoyances, ie can't join. I had to use tricks like using Tuples to allow for some join operations.
  • I haven't tried reading generic types, my parameterized classes were introduced later using map. For example, I read TypeA and later worked with Dataset<MyClass>.
  • I was able to use more complex, custom types in the generics, not just Integer, String, etc...
  • There were some annoying details like having to pass along Class literals, ie TypeA.class and using raw Types for certain map functions etc...
Hyo Byun
  • 1,196
  • 9
  • 18
-1

Yes, you can have Dataset of your own class. It Would look like Dataset<MyOwnClass>

In the code below I have tried to read a file content and put it in the Dataset of the class that we have created. Please check the snippet below.

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;

import java.io.Serializable;

public class FileDataset {
    public static class Employee implements Serializable {
        public int key;
        public int value;
    }

    public static void main(String[] args) {
        // configure spark
        SparkSession spark = SparkSession
                .builder()
                .appName("Reading JSON File into DataSet")
                .master("local[2]")
                .getOrCreate();

        final Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);

        final String jsonPath = "/Users/ajaychoudhary/Documents/student.txt";

        // read JSON file to Dataset
        Dataset<Employee> ds = spark.read()
                .json(jsonPath)
                .as(employeeEncoder);
        ds.show();
    }
}

The content of my student.txt file is

{ "key": 1, "value": 2 }
{ "key": 3, "value": 4 }
{ "key": 5, "value": 6 }

It produces the following output on the console:

+---+-----+
|key|value|
+---+-----+
|  1|    2|
|  3|    4|
|  5|    6|
+---+-----+

I hope this gives you an initial idea of how you can have the dataset of your own custom class.

Ajay Kr Choudhary
  • 1,304
  • 1
  • 14
  • 23
  • 1
    I know I can have a Dataset of my own class. The question was if you can have a dataset of a parameterised/generic class. For example, let's say that your employee class has a field "idNumber" which can be either of type String of Long. In this case you can parameterise the class like so: class Employee {}. Afterwards you can declare objects depending on your needs. E.g. Employee employee1 = new Employee<>(1,2,3L); Employee employee2 = new Employee<>(1,2,"A1B2C3");. Please tell me if I made myself clear. – Bîrsan Octav Sep 02 '20 at 08:00
  • This requirement is not very clear from the question. Or I did not receive it in that way. Please edit your question with an example. Someone can help with that. I do not know how to do that. – Ajay Kr Choudhary Sep 02 '20 at 08:42