2

I am trying to read an avro file with DataFrame, but keep getting:

org.apache.spark.sql.avro.IncompatibleSchemaException: Unsupported type NULL

Since I am going to deploy it on Dataproc I am using Spark 2.4.0, but the same happened when I tried other versions.

Following is my dependencies:

 <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
 </dependencies>

My main class:

public static void main(String[] args) {

        SparkConf sparkConf = new SparkConf()
                .setAppName("Example");

        SparkSession spark = SparkSession
                .builder()
                .appName("Java Spark SQL basic example")
                .getOrCreate();

        Dataset<Row> rowDataset = spark.read().format("avro").load("avro_file");

   }

Running command:

spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 --master local[*] --class MainClass my-spak-app.jar

After running a lot of tests I concluded that it happens because I have in my avro schema a field defined with "type": "null". I am not creating the files I am working on so I can't change the schema. I am able to read the files when I am using RDD and read the file with newAPIHadoopFile method.

Is there a way to read avro files with "type": "null" using Dataframe or I will have to work with RDD?

ohaionm
  • 31
  • 2
  • Looking at Spark's schema conversion function it indeed appears that Avro's null type is not supported. One work-around to try would be to skip schema inference by providing the schema manually? – Muton Oct 10 '19 at 10:53
  • Thanks for the answer it worked this way: spark.read().option("avroSchema", schema).format("avro").load("avro_file"); Now I am trying to convert the resulted Dataset to Dataset of my own object. I tried as(Encoders.bean(MyClass.class)) but got: UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema – ohaionm Oct 10 '19 at 12:50

1 Answers1

2

You can specify a schema when you read the file. Create a schema for your file

val ACCOUNT_schema = StructType(List(
    StructField("XXX",DateType,true),
    StructField("YYY",StringType,true))


val rowDataset = spark.read().format("avro").option("avroSchema", schema).load("avro_file");

I am not very familiar with java syntax, but I think you can manage it.

Saswat
  • 71
  • 4
  • Thanks for the answer it worked this way: spark.read().option("avroSchema", schema).format("avro").load("avro_file"); Now I am trying to convert the resulted Dataset to Dataset of my own object. I tried as(Encoders.bean(MyClass.class)) but got: UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class class org.apache.avro.Schema – ohaionm Oct 10 '19 at 12:46
  • May be you are being affected by https://issues.apache.org/jira/browse/AVRO-695 - can you try upgrading your avro version? – Aniket Mokashi Oct 10 '19 at 22:20