1

I've written a custom Streamsets origin. Some of the records contain characters like é or ë. When running my automated tests I can validate that the data is emitted as a list of SDC Records as intended.

When I use my custom origin in a pipeline on a dockerized Streamsets Data Collector however, all of those special characters are displayed in the UI (preview) and pushed to my Target as '?'.

Is Streamsets interpreting the output of my origin and applying some character encoding?

nielsn
  • 87
  • 1
  • 8

1 Answers1

1

The problem was not in the custom origin or Streamsets at all, rather it was an issue with the Docker container itself. The official Streamsets container from which I inherit, is based on Alpine Linux. No locale support is installed by default, so the trick is to add it by yourself.

This post helped me out in installing it in my container and configuring the container. Afterwards, all worked as expected.

nielsn
  • 87
  • 1
  • 8