0

I was watching one of Robin Moffatt's videos (https://rmoff.net/2020/06/17/loading-csv-data-into-kafka/) and believe Apache Kafka might help me automate a workflow I have.

I have a requirment where I need to ingest a CSV from a customer, send off a subset of the orignal infromation to 2 vendors in various formats (text or csv), receive data back from those vendors, and then merge all of the data.

I'm somewhat of a nube to Kafka but was thinking I'd have a process as follows:

Ingest data from customer into kafka and save to either a SQL Server or Postgres database. I will then publish 2 "we have data" streams. Each stream would essentially have a single row that represents the batch we received from the customer. These streams which are topics will be consumed by a kafkaJS consumer. Using information in the message, these consumers will essentially select data out of the database based on the output required for that vendor.

At this point in the process we are expecting 2 responses. As each response comes in (SFTP) we will ingest the response file (JSON or CSV) into the db like we did with the orignal customer information. If we have received all of the data we will publish another message which will be consumed by the consumer which merges all of the data.

Do any of the Kafka ninjas like Robin have any suggestions? Much appreciated.

GD

Greg
  • 147
  • 2
  • 9
  • One of my questions is how/where do I publish the message that I've received a new batch once all records have been ingested. – Greg May 03 '21 at 19:42
  • Assuming you actually need Kafka, why not load the CSV into a database, then use Debezium to get that into Kafka? – OneCricketeer May 04 '21 at 12:16
  • @OneCricketeer - with kafka I have an "easy" button to get the csv into the db. I really just need to ingest the data and then kick off tasks once it's ingested. – Greg May 05 '21 at 06:56
  • Sure. Still doesn't answer my question - why not use Debezium to poll the database to a Kafka topic – OneCricketeer May 05 '21 at 13:04
  • The process starts with a csv. We ingest the csv into a db using kafka. Would rather kick off an event when we receive the csv as opposed to polling a db. – Greg May 06 '21 at 06:25
  • Debezium isn't polling, it reads the binlogs of the database, which are real-time on insert/create, delete, or update events. Therefore it does exactly what you want - "kick off event when csv is inserted"... Maybe you can try it and tell us that it doesn't do what you want rather than try to find alternatives that don't fit into existing Kafka tooling? Any other solution you'll get will be polling your some file system, for example https://github.com/jcustenborder/kafka-connect-spooldir – OneCricketeer May 06 '21 at 12:22
  • Or reading from SFTP https://camel.apache.org/camel-kafka-connector/latest/connectors/camel-sftp-kafka-source-connector.html – OneCricketeer May 06 '21 at 12:30

1 Answers1

0

The most scalable way is to probably to create a read stream of the csv file and on the chunk read stream (think of it like iterating the values of the file), produce the message via KafkaJS.

https://www.digitalocean.com/community/tutorials/how-to-read-and-write-csv-files-in-node-js-using-node-csv This article shows the streaming. .on(‘data’) section is where you will work with the stream; manipulating it, saving to db, producing to Kafka, these are all valid.

For the correct setup for Kafka, I’d choose a good library. For NodeJS, that’s kafka-js >= 2.0.0 (below had concurrency problems).

By initializing a consumer and producer, you will have effectively created a Kafka infrastructure inside the service, which serves well a messaging-based microservices architectures system.

Much of the concept I laid out here have great tutorials, you just have to use the right libraries and understand how the streams will work with your code logic. I merely provided a recipe for it.

Hope it helps!

Victor P.
  • 93
  • 1
  • 8