DataFrameReader throwing "Unsupported type NULL" while reading avro file

Question

I am trying to read an avro file with DataFrame, but keep getting:

org.apache.spark.sql.avro.IncompatibleSchemaException: Unsupported type NULL

Since I am going to deploy it on Dataproc I am using Spark 2.4.0, but the same happened when I tried other versions.

Following is my dependencies:

 <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
 </dependencies>

My main class:

public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf()
                .setAppName("Example");

        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .getOrCreate();

        Dataset<Row> rowDataset = spark.read().format("avro").load("avro_file");

   }

Running command:

spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 --master local[*] --class MainClass my-spak-app.jar

After running a lot of tests I concluded that it happens because I have in my avro schema a field defined with "type": "null". I am not creating the files I am working on so I can't change the schema. I am able to read the files when I am using RDD and read the file with newAPIHadoopFile method.

Is there a way to read avro files with "type": "null" using Dataframe or I will have to work with RDD?

Looking at Spark's schema conversion function it indeed appears that Avro's null type is not supported. One work-around to try would be to skip schema inference by providing the schema manually? — Muton, Oct 10 '19 at 10:53
Thanks for the answer it worked this way: spark.read().option("avroSchema", schema).format("avro").load("avro_file"); Now I am trying to convert the resulted Dataset to Dataset of my own object. I tried as(Encoders.bean(MyClass.class)) but got: UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema — ohaionm, Oct 10 '19 at 12:50

Saswat · Answer 1 · 2019-10-10T13:11:03.487

2

You can specify a schema when you read the file. Create a schema for your file

val ACCOUNT_schema = StructType(List(
    StructField("XXX",DateType,true),
    StructField("YYY",StringType,true))


val rowDataset = spark.read().format("avro").option("avroSchema", schema).load("avro_file");

I am not very familiar with java syntax, but I think you can manage it.

edited Oct 10 '19 at 13:11

answered Oct 10 '19 at 12:35

Saswat

71
4

Thanks for the answer it worked this way: spark.read().option("avroSchema", schema).format("avro").load("avro_file"); Now I am trying to convert the resulted Dataset to Dataset of my own object. I tried as(Encoders.bean(MyClass.class)) but got: UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema – ohaionm Oct 10 '19 at 12:46
May be you are being affected by https://issues.apache.org/jira/browse/AVRO-695 - can you try upgrading your avro version? – Aniket Mokashi Oct 10 '19 at 22:20

DataFrameReader throwing "Unsupported type NULL" while reading avro file

1 Answers1