0

My setup is as follows: I'm retrieving xml files from an ftp server, unmarshall those into a POJO, map that into an Avro-generated class and then forward it into Alpakkas's Producer Sink like so:

Ftp.ls("/", ftpSettings)
  .filter(FtpFile::isFile)
  .mapAsyncUnordered(10,
    ftpFile -> {
      CompletionStage<ByteString> fetchFile =
        Ftp.fromPath(ftpFile.path(), ftpSettings).runWith(Sink.reduce((a, b) -> a), materializer);
      return fetchFile;
    })
  .map(b -> b.decodeString(Charsets.ISO_8859_1))
  .map(StringReader::new)
  .map(AlpakkaProducerDemo::unmarshalFile)
  .map(AlpakkaProducerDemo::convertToAvroSerializable)
  .map(a -> new ProducerRecord<>(kafkaTopic, a.id().toString(), a))
  .map(record -> ProducerMessage.single(record))
  .runWith(Producer.committableSink(producerSettings, kafkaProducer), materializer);

The problem is that the serialization apparently doesn't work properly. E.g. I'd like the the key to be avro-serialized as well, though it's only a string (requirement, don't ask). The config for that looks like:

Map<String, Object> kafkaAvroSerDeConfig = new HashMap<>();
kafkaAvroSerDeConfig.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
final KafkaAvroSerializer keyAvroSerializer = new KafkaAvroSerializer();
keyAvroSerializer.configure(kafkaAvroSerDeConfig, true);
final Serializer<Object> keySerializer = keyAvroSerializer;
final Config config = system.settings().config().getConfig("akka.kafka.producer");
final ProducerSettings producerSettings = ProducerSettings.create(config, keySerializer, valueSerializer)
  .withBootstrapServers(kafkaServer);

In Kafka, this results in a key with the correct content, but some (apparent) extra bytes at the beginning of the string: \u0000\u0000\u0000\u0000\u0001N. As you can imagine, that wreaks havoc with the value. I suspect that the Avro serialization doesn't play nice with the envelope API used by Alpakka, so that it may be necessary to serialize to a byte[] beforehand and use the common ByteSerializer. However, there'd be no real point to using SchemaRegistry then.

styps
  • 279
  • 2
  • 14

1 Answers1

2

The first five bytes are to do with the serialisation format version (byte 0) and the version of the Avro schema in the Schema Registry (bytes 1-4): https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format.

Another option could be just to use Kafka Connect, with the FTP source and XML transform.

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
  • Thanks for the quick clarification! We consider Connect, but want to evaluate Alpakka as well. Any obvious reason the bytes would appear in key/value? – styps Jul 05 '19 at 16:15
  • They'll appear because that's the serialisation wire format. If you don't want them…don't use the serialiser :) It's another reason why Kafka Connect is a good option—it just works with stuff like serialisation of input/output. – Robin Moffatt Jul 05 '19 at 17:58
  • Don't know why or how, because I didn't change a thing, but today the magic bytes don't appear and the serialization just works ... – styps Jul 11 '19 at 08:18