0

I need to convert following to Spark DataFrame in Java with the saving of the structure according to the avro schema. And then I'm going to write it to s3 based on this avro structure.

GenericRecord r = new GenericData.Record(inAvroSchema);
r.put("id", "1");
r.put("cnt", 111);

Schema enumTest =
        SchemaBuilder.enumeration("name1")
                .namespace("com.name")
                .symbols("s1", "s2");

GenericData.EnumSymbol symbol = new GenericData.EnumSymbol(enumTest, "s1");

r.put("type", symbol);

ByteArrayOutputStream bao = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> w = new GenericDatumWriter<>(inAvroSchema);

Encoder e = EncoderFactory.get().jsonEncoder(inAvroSchema, bao);
w.write(r, e);
e.flush();

I can create the object based on JSON structure

  Object o = reader.read(null, DecoderFactory.get().jsonDecoder(inAvroSchema, new ByteArrayInputStream(bao.toByteArray())));

But maybe there is any way to create DataFrame based on ByteArrayInputStream(bao.toByteArray())?

Thanks

1 Answers1

0

No, you have to use a Data Source to read Avro data. And it's crutial for Spark to read Avro as files from filesystem, because many optimizations and features depend on it (such as compression and partitioning). You have to add spark-avro (unless you are above 2.4). Note that EnumType you are using will be String in Spark's Dataset

Also see this: Spark: Read an inputStream instead of File

Alternatively you can consider deploying a bunch of tasks with SparkContext#parallelize and reading/writing the files explicitly by DatumReader/DatumWriter.

andreoss
  • 1,570
  • 1
  • 10
  • 25
  • Thanks @andreoss. Before this part of code I plan to read avro data and make some aggregation with spark.sql. Main idea to apply the avro schema and then write down to s3. It does not work “avroSchema” option when I write it to s3 and If I write it directly from aggregated data frame there is an issue with Enum value because it was saved as String to s3. I found only one way to apply avroSchema as mentioned in above script. I can save it to avro file in local ...but it does not work with s3.... – Sergii Chukhno Jul 02 '20 at 05:23
  • You can work with files directly and still use Spark in order to schedule your work, but you won't have any Dataset than – andreoss Jul 02 '20 at 05:34