I want to store all the data from a Kafka's topic into Amazon S3. I have a Kafka cluster that receives in one topic 200.000 messages per second, and each value message has 50 fields (strings, timestamps, integers, and floats).
My main idea is to use Kafka Connector to store the data in a bucket s3 and after that use Amazon Glue to transform the data and keep it into another bucket. I have the next questions:
1) How to do it? That architecture will work well? I tried with Amazon EMR (Spark Streaming) but I had too many concerns How to decrease the processing time and failed tasks using Apache Spark for events streaming from Apache Kafka?
2) I tried to use Kafka Connect from Confluent, but I have a few questions:
Can I connect to my Kafka Cluster from other Kafka instance and run in a standalone way my Kafka Connector s3?
What means this error "ERROR Task s3-sink-0 threw an uncaught an
unrecoverable exception"?
ERROR Task s3-sink-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142) java.lang.NullPointerException at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:122) at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:290) at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:421) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:26,086] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:143) [2018-10-05 15:32:27,980] WARN could not create Dir using directory from url file:/targ. skipping. (org.reflections.Reflections:104) java.lang.NullPointerException at org.reflections.vfs.Vfs$DefaultUrlTypes$3.matches(Vfs.java:239) at org.reflections.vfs.Vfs.fromURL(Vfs.java:98) at org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at org.reflections.Reflections.scan(Reflections.java:237) at org.reflections.Reflections.scan(Reflections.java:204) at org.reflections.Reflections.(Reflections.java:129) at org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268) at org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377) at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:27,981] WARN could not create Vfs.Dir from url. ignoring the exception and continuing (org.reflections.Reflections:208) org.reflections.ReflectionsException: could not create Vfs.Dir from url, no matching UrlType was found [file:/targ] either use fromURL(final URL url, final List urlTypes) or use the static setDefaultURLTypes(final List urlTypes) or addDefaultURLTypes(UrlType urlType) with your specialized UrlType. at org.reflections.vfs.Vfs.fromURL(Vfs.java:109) at org.reflections.vfs.Vfs.fromURL(Vfs.java:91) at org.reflections.Reflections.scan(Reflections.java:237) at org.reflections.Reflections.scan(Reflections.java:204) at org.reflections.Reflections.(Reflections.java:129) at org.apache.kafka.connect.runtime.AbstractHerder.connectorPlugins(AbstractHerder.java:268) at org.apache.kafka.connect.runtime.AbstractHerder$1.run(AbstractHerder.java:377) at java.lang.Thread.run(Thread.java:745) [2018-10-05 15:32:35,441] INFO Reflections took 12393 ms to scan 429 urls, producing 13521 keys and 95814 values (org.reflections.Reflections:229)
- If you can resume the steps to connect to Kafka and keep on s3 from
another Kafka instance, how will you do? - What means all these fields key.converter, value.converter, key.converter.schemas.enable, value.converter.schemas.enable, internal.key.converter,internal.value.converter, internal.key.converter.schemas.enable, internal.value.converter.schemas.enable?
What are the possible values for key.converter, value.converter?
3) Once my raw data is in a bucket, I would like to use Amazon Glue to take these data, to deserialize Protobuffer, to change the format of some fields, and finally to store it in another bucket in Parquet. How can I use my own java protobuffer library in Amazon Glue?
4) If I want to query with Amazon Athena, how can I load the partitions automatically (year, month, day, hour)? With the crawlers and schedulers of Amazon Glue?