How to connect Apache Kafka with Amazon S3?

Question

I want to store data from Kafka into a bucket s3 using Kafka Connect. I had already a Kafka's topic running and I had a bucket s3 created. My topic has data on Protobuffer, I tried with https://github.com/qubole/streamx and I obtained the next error:

 [2018-10-04 13:35:46,512] INFO Revoking previously assigned partitions [] for group connect-s3-sink (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:280)
 [2018-10-04 13:35:46,512] INFO (Re-)joining group connect-s3-sink (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:326)
 [2018-10-04 13:35:46,645] INFO Successfully joined group connect-s3-sink with generation 1 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:434)
 [2018-10-04 13:35:46,692] INFO Setting newly assigned partitions [ssp.impressions-11, ssp.impressions-10, ssp.impressions-7, ssp.impressions-6, ssp.impressions-9, ssp.impressions-8, ssp.impressions-3, ssp.impressions-2, ssp.impressions-5, ssp.impressions-4, ssp.impressions-1, ssp.impressions-0] for Group connect-s3-sink(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:219)
 [2018-10-04 13:35:47,193] ERROR Task s3-sink-0 threw an uncaught an unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142)
 java.lang.NullPointerException
    at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:122)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[2018-10-04 13:35:47,194] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:143)
[2018-10-04 13:35:51,235] INFO Reflections took 6844 ms to scan 259 urls, producing 13517 keys and 95788 values (org.reflections.Reflections:229)

I did the next steps:

I cloned the repository.
mvn DskipTests package

nano config/connect-standalone.properties

bootstrap.servers=ip-myip.ec2.internal:9092
key.converter=com.qubole.streamx.ByteArrayConverter
value.converter=com.qubole.streamx.ByteArrayConverter

nano config/quickstart-s3.properties

name=s3-sink 
connector.class=com.qubole.streamx.s3.S3SinkConnector
format.class=com.qubole.streamx.SourceFormat tasks.max=1
topics=ssp.impressions
flush.size=3
s3.url=s3://myaccess_key:mysecret_key@mybucket/demo

connect-standalone /etc/kafka/connect-standalone.properties quickstart-s3.properties

I would like to know if that I did is okay or another way to keep data into S3 from Kafka.

What tutorial did you find? Any new Kafka consumer can be configured to read from beginning offset for existing data — OneCricketeer, Oct 03 '18 at 13:44
You should indicate in your question text if you fundamentally reword it to ask a different question. My answer stands to your original question of "how to connect to Apache Kafka to S3". — Robin Moffatt, Oct 04 '18 at 16:06

Robin Moffatt · Accepted Answer · 2020-04-24T10:27:43.347

14

You can use Kafka Connect to do this integration, with the Kafka Connect S3 connector.

Kafka Connect is part of Apache Kafka, and the S3 connector is an open-source connector available either standalone or as part of Confluent Platform.

For general information and examples of Kafka Connect, this series of articles might help:

Disclaimer: I work for Confluent, and wrote the above blog articles.

April 2020: I have recorded a video showing how to use the S3 sink: https://rmoff.dev/kafka-s3-video

edited Apr 24 '20 at 10:27

answered Oct 03 '18 at 11:28

Robin Moffatt

30,382
3
65
92

I want to use Kafka Connect, but I have the data on Protobuffer, that can be a problem? key.converter= value.converter= key.converter.schemas.enable= value.converter.schemas.enable= internal.key.converter= internal.value.converter= internal.key.converter.schemas.enable= internal.value.converter.schemas.enable= – Eric Bellet Oct 04 '18 at 13:50
1

Yes there is an open-source protobuf converter for Kafka Connect that you can use: https://www.confluent.io/connector/kafka-connect-protobuf-converter/ – Robin Moffatt Oct 04 '18 at 16:07
I had many questions. Can I connect to my Kafka Cluster from an other Kafka instance and run in a standalone way my Kafka Connector s3? What means this error "ERROR Task s3-sink-0 threw an uncaught an unrecoverable exception"? If you can resume the steps to connect to Kafka and keep on s3 from another Kafka instance, how will you do? – Eric Bellet Oct 05 '18 at 08:32

score 0 · Answer 2 · answered Jul 01 '20 at 11:51

0

Another way would be to write a consumer with log rotation and then corn files to S3 .

answered Jul 01 '20 at 11:51

Navin Kumar

150
2
10

How to connect Apache Kafka with Amazon S3?

2 Answers2