1

I want to read from a parquet file as from a table using Apache Calcite. There are bunch of adapters listed in the docs, but no explicit one for parquet. From the other hand there is adapter for Spark, which can deal with Parquet perfectly. But for some reason I can't really find any example of how to use this spark adapter. Even reading it's code I can't say I understand how I need to define Spark's schemas. There is no factory for it... I've tried following code without really understanding how it should work and it obviously doesn't work:

        Class.forName("org.apache.calcite.jdbc.Driver");
        Properties info = new Properties();
        info.setProperty("lex", "JAVA");
        info.setProperty("spark", "true");

        Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
        CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);
        SchemaPlus rootSchema = calciteConnection.getRootSchema();

        SparkSession spark = SparkSession.builder()
                .appName("test")
                .master("local[1]")
                .getOrCreate();
        StopWatch w = StopWatch.createStarted();
        Dataset<Row> ds = spark.read().parquet("/tmp/test.parquet");
        ds.select("issue_desc", "valid_from_dttm").show(15);
        ds.printSchema();
        ds.createTempView("sparkTable");
        System.out.println(w.getTime(TimeUnit.MILLISECONDS));

        FrameworkConfig calciteConfig = Frameworks.newConfigBuilder()
                .parserConfig(SqlParser.Config.DEFAULT)
                .defaultSchema(rootSchema)
                .programs()
                .traitDefs(ConventionTraitDef.INSTANCE, RelDistributionTraitDef.INSTANCE)
                .build();

        RelBuilder builder = RelBuilder.create(calciteConfig);
        RelRunner relRunner = calciteConnection.unwrap(RelRunner.class);

        RelNode test1 = builder
                .scan("sparkTable")
                .build();
        executeNode(relRunner, test1);

It simply fails with the exception:

Exception in thread "main" org.apache.calcite.runtime.CalciteException: Table 'sparkTable' not found
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:506)
    at org.apache.calcite.runtime.Resources$ExInst.ex(Resources.java:600)
    at org.apache.calcite.tools.RelBuilder.scan(RelBuilder.java:1238)
    at org.apache.calcite.tools.RelBuilder.scan(RelBuilder.java:1265)
    at ru.tinkoff.dwh.hercule.demo.SparkTest.main(SparkTest.java:64)

Could please somebody explain how to use it, or share an example, or explain why I can't use spark adapter like this?

kalmar
  • 45
  • 5
  • Apache Drill is a distributed MPP query layer for self describing data like Parquet files. @karmal, have you tried this approach ? – João Paraná May 12 '22 at 18:22
  • Drill uses Calcite to parse the queries. See [https://www.xenonstack.com/blog/apache-drill-architecture](https://www.xenonstack.com/blog/apache-drill-architecture) – João Paraná May 12 '22 at 18:33

0 Answers0