2

I have a UTF8-mb4 char in mongo and have a java extractor which extracts data from mongo and put into kafka. When the data goes to Kafka the special char has been replaced with \u...

Sample text:- "\uDBFF\uDC15COMPANY"

I have another Java program which reads from one kafka topic and put it into another kafka topic after some operation. When the data is read from the actual topic, the \u... is been decoded to an actual special char and when the data is pushed to the target topic, it is is like some junk char. How to put the data back to the target topic as \u ...

The same message in the target topic is like,

"COMPANY"

Note:-

The message has lots of data(JSON data) and there could be a special char in any json value.

While reading from the source topic,

For consumer to consume from the source topic,

key.deserializer = "org.apache.kafka.common.serialization.ByteArrayDeserializer"
value.deserializer = "org.apache.kafka.common.serialization.ByteArrayDeserializer"

For produce to produce in the target topic,

key.serializer = "org.apache.kafka.common.serialization.ByteArraySerializer"
value.serializer = "org.apache.kafka.common.serialization.ByteArraySerializer"
user1578872
  • 7,808
  • 29
  • 108
  • 206
  • _have a java extractor which extracts data from mongo and put into kafka_ - Why not use [Debezium](https://debezium.io/documentation/reference/stable/connectors/mongodb.html) for this? – OneCricketeer Aug 01 '22 at 18:42

2 Answers2

0

Since you're using ByteArraySerializer to preserve the data exactly as written rather than StringSerializer, for example, which defaults to utf8 encoding, then you're going to potentially get those database control characters, which cannot be displayed as ASCII, so you end up with instead.

The same message in the target topic is like,

"COMPANY"

Unclear what you're using to view this data, but perhaps the issue exists in the encoding of that program, not Kafka itself, or your choice of serializer

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Updated my comments with the serializer being used. – user1578872 Aug 01 '22 at 01:54
  • And what tool are you using to "view" the data? `kafka-console-consumer` will default to StringDeserializer, and then your terminal is configured to use UTF8 charset, for example, which is why you'd see characters like `` or `�` – OneCricketeer Aug 01 '22 at 18:01
  • I am using browser to see this text. – user1578872 Aug 01 '22 at 18:28
  • Okay. What charset is used in the HTML? – OneCricketeer Aug 01 '22 at 18:37
  • My source topic has encoded char which i could see in the browser, but the target topic has the jumbled char. – user1578872 Aug 01 '22 at 19:43
  • Again, Kafka stores bytes. There are no "characters" in the topic. Whatever you're using to view the data needs its encoding changed. Or whatever you're using to produce the data needs to parse the data more appropriately (Debezium linked in the comments would do that). More specifically, lookup unicode values for `DBFF` and `DC15` and you'll see they are database control characters with no ASCII symbol – OneCricketeer Aug 02 '22 at 13:22
0

You can try other serializers. I think this would be usefulmongodb

Garikina Rajesh
  • 409
  • 3
  • 5