3

I'm having difficulty connecting to and retrieving data from a kafka instance. Using python's kafka-python module, I can connect (using the same connection parameters), see the topic, and retrieve data, so the network is viable, there is no authentication problem, the topic exists, and data exists in the topic.

On R-4.0.5 using sparklyr-1.7.2, connecting to kafka-2.8

library(sparklyr)
spark_installed_versions()
#   spark hadoop                                      dir
# 1 2.4.7    2.7 /home/r2/spark/spark-2.4.7-bin-hadoop2.7
# 2 3.1.1    3.2 /home/r2/spark/spark-3.1.1-bin-hadoop3.2

sc <- spark_connect(master = "local", version = "2.4",
                    config = list(
                        sparklyr.shell.packages = "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0"
                    ))

system.time({
  Z <- stream_read_kafka(
    sc,
    options = list(
      kafka.bootstrap.servers="11.22.33.44:5555",
      subscribe = "mytopic"))
})
#    user  system elapsed
#   0.080   0.000  10.349

system.time(collect(Z))
#    user  system elapsed
#   1.336   0.136   8.537

Z
# # Source: spark<?> [inf x 7]
# # … with 7 variables: key <lgl>, value <lgl>, topic <chr>, partition <int>, offset <dbl>, timestamp <dbl>, timestampType <int>

My first concern is that I'm not seeing data from the topic, I appear to be getting a frame suggesting (meta)data about topics in general, and there is nothing found. With this topic, there are 800 strings (json), modest-to-small sizes. My second concern is that it takes almost 20 seconds to realize this problem (though I suspect that's a symptom of the larger connection problem).

For confirmation, this works:

cons = import("kafka")$KafkaConsumer(bootstrap_servers="11.22.33.44:5555", auto_offset_reset="earliest", max_partition_fetch_bytes=10240000L)
cons$subscribe("mytopic")
msg <- cons$poll(timeout_ms=30000L, max_records=99999L)
length(msg)
# [1] 1
length(msg[[1]])
# [1] 801
as.character( msg[[1]][[1]]$value )
# [1] "{\"TrackId\":\"c839dcb5-...\",...}"

(and those commands complete almost instantly, nothing like the 8-10sec lag above).

The kafka instance to which I'm connecting is using ksqlDB, though I don't think that's a requirement in order to need to use the "org.apache.spark:spark-sql-kafka-.." java package.

(Ultimately I'll be using stateless/stateful procedures on streaming data, including joins and window ops, so I'd like to not have to re-implement that from scratch on the simple kafka connection.)

r2evans
  • 141,215
  • 6
  • 77
  • 149

0 Answers0