1

I've used spark structured streaming conume kafka messages and save data to redis. By extending the ForeachWriter[org.apache.spark.sql.Row], I used a redis sink to save data. The code runs well but just a little more than 100 datas be saved to redis per second. Is there any better way to speed up the procedure? While code like below would connect and disconnect to redis server every mico batch, any way to just connect once and keep the connections to miniminze the cost of connection which I supposed is the main cause of time consuming? I tried broadcast jedis but neither jedis nor jedispool isserializable so it didn't work.

My sink code is below:

class StreamDataSink extends ForeachWriter[org.apache.spark.sql.Row]{

  var jedis:Jedis = _

  override def open(partitionId:Long,version:Long):Boolean={
    if(null == jedis){
      jedis = FPCRedisUtils.getPool.getResource
    }
    true
  }

  override def process(record: Row): Unit = {

    if(0 == record(3)){
      jedis.select(Constants.REDIS_DATABASE_INDEX)
      if(jedis.exists("counter")){
        jedis.incr("counter")
      }else{
        jedis.set("counter",1.toString)
      }
    }
  }

  override def close(errorOrNull: Throwable): Unit = {
    if(null != jedis){
      jedis.close()
      jedis.disconnect()
    }
  }

Any suggestions will be appreciated.

lf.D
  • 25
  • 7

1 Answers1

0

Don't do jedis.disconnect(). This will actually close the socket, forcing a new connection next time around. Use only jedis.close(), it will return the connection to the pool.

When you call INCR on a non-existing key, it is automatically created, default to zero and then incremented, resulting in a new key with value 1.

This simplifies your if-else to simply jedis.incr("counter").

With this you have:

jedis.select(Constants.REDIS_DATABASE_INDEX)
jedis.incr("counter")

Review if you really need the SELECT. This is per connection and all connections default to DB 0. If all workloads sharing the same jedis pool are using DB 0, there is no need to call select.

If you do need both select and incr, then pipeline them:

Pipeline pipelined = jedis.pipelined()
pipelined.select(Constants.REDIS_DATABASE_INDEX)
pipelined.incr("counter")
pipelined.sync()

This will send the two commands in one network message, further improving your performance.

LeoMurillo
  • 6,048
  • 1
  • 19
  • 34
  • Thanks very much. I tried to refine my code like above. And It does work since pip multiple instructions in one connection. The speed of manipulating data now is about 160 records per second(previous is 110). Is this the normal speed using structured streaming with redis ? By the way, I use a local enviroment with 4cores and 16G memory. – lf.D Jan 13 '20 at 10:00
  • Don't know about structured streaming, but on local machine, a client should give you tenths of thousands of requests per second - when slow. You can try a dummy sink, without redis, inheriting the same ForeachWriter, but counting in a local int, to see what's your baseline. – LeoMurillo Jan 13 '20 at 10:12
  • The sink I used including code that write data into cassandra the same time. I've tried to just write to cassandra without writing to redis and It perform well. Like sousands records per second. So I supppsed the connection to redis every trigger slowed down the whole procedure. – lf.D Jan 13 '20 at 10:22