How can I improve performance of adding data to ScyllaDB?

Question

I try to use prepared statements as it was described in the official Cassandra and Scylla documentation however performance is still around 30 seconds for 100,000 of messages. Any ideas how can I improve this?

query = "INSERT INTO message (id, message) VALUES (?, ?)"
prepared = session.prepare(query)
for key in range(100000):

    try:
        session.execute_async(prepared, (0, "my example message"))
    except Exception as e:
        print("An error occured : " + str(e))
        pass

UPDATE

I found information that it is highly recommended to use batches to improve performance so I used prepared statements and batches in accordance to the official documentation. My code at the moment looks in this way:

print("time 0: " + str(datetime.now()))
query = "INSERT INTO message (id, message) VALUES (uuid(), ?)"
prepared = session.prepare(query)

for key in range(100):

    print(key)

    try:

        batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)
        for key in range(100):

            batch.add(prepared, ("example message",))

        session.execute(batch)

    except Exception as e:
        print("An error occured : " + str(e))
        pass

print("time 1: " + str(datetime.now()))

Do you have an idea why performance is so slow and after running this source code the result looks like shown below?

test 0: 2018-06-19 11:10:13.990691
0
1
...
41
cAn error occured : Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for messages.message - received only 1 responses from 2 CL=QUORUM." info={'write_type': 'BATCH', 'required_responses': 2, 'consistency': 'QUORUM', 'received_responses': 1}
42
...
52                                                                                                                                                                             An error occured : errors={'....0.3': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=.....0.3
53
An error occured : Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for messages.message - received only 1 responses from 2 CL=QUORUM." info={'write_type': 'BATCH', 'required_responses': 2, 'consistency': 'QUORUM', 'received_responses': 1}
54
...
59
An error occured : Error from server: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for messages.message - received only 1 responses from 2 CL=QUORUM." info={'write_type': 'BATCH', 'required_responses': 2, 'consistency': 'QUORUM', 'received_responses': 1}
60
61
62
...
69
70
71
An error occured : errors={'.....0.2': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=.....0.2
72
An error occured : errors={'....0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=....0.1
73
74
...
98
99
test 1: 2018-06-19 11:11:03.494957

"I found information that it is highly recommended to use batches to improve performance." No it is not. BATCH is something to be used for ensuring atomicity of writes. But you actually take at least a 30% performance hit right off the top. Performance is probably terrible because you're making a coordinator node have to manage a large batch of INSERTs at QUORUM. — Aaron, Jun 19 '18 at 12:49
What is the topology of your cluster? It is a single node? If so, are you bound by IO? — Carlos Rolo, Jun 19 '18 at 13:54
@Aaron Do you have any suggestions how can I improve performance? — thedbogh, Jun 20 '18 at 07:22
@CarlosRolo At the moment I have access to single node and I would like to achieve close to 100,000 records per second. Any ideas how can I achieve this? — thedbogh, Jun 20 '18 at 07:31

score 2 · Answer 1 · answered Jun 20 '18 at 09:52

On my machine I get sub second execution times for this type of issue using a local machine by heavily parallellizing the inserts.

➜  loadz ./loadz
execution time: 951.701622ms

I don't know how to do it in Python I am afraid but in Go it can look like something like this:

package main

import (
  "fmt"
  "sync"
  "time"

  "github.com/gocql/gocql"
)

func main() {
  cluster := gocql.NewCluster("127.0.0.1")
  cluster.Keyspace = "mykeyspace"

  session, err := cluster.CreateSession()
  if err != nil {
      panic(err)
  }
  defer session.Close()

  workers := 1000
  ch := make(chan *gocql.Query, 100001)
  wg := &sync.WaitGroup{}
  wg.Add(workers)

  for i := 0; i < workers; i++ {
      go func() {
          defer wg.Done()
          for q := range ch {
              if err := q.Exec(); err != nil {
                  fmt.Println(err)
              }
          }
      }()
  }

  start := time.Now()
  for i := 0; i < 100000; i++ {
      ch <- session.Query("INSERT INTO message (id,message) VALUES (uuid(),?)", "the message")
  }
  close(ch)
  wg.Wait()
  dur := time.Since(start)
  fmt.Printf("execution time: %s\n", dur)
}

Please adjust connection params as needed if you feel like testing it.

How can I improve performance of adding data to ScyllaDB?

1 Answers1