Low performance in Aerospike for batchRequests

Question

I have an application that does a lot of batchGets (mostly with ~2000 keys). Here is the code I use:

AerospikeClient aerospike = new AerospikeClient("10.0.240.2", port)

public byte[][] getBatch(byte[][] keys) {
    Key[] aeroKeys = new Key[keys.length];
    for (int i = 0; i < keys.length; i++) {
        aeroKeys[i] = new Key(NAMESPACE, setName, keys[i]);
    }
    Record[] records = aerospike.get(batchPolicy, aeroKeys);
    byte[][] response = new byte[keys.length][];

    for (int i = 0; i < keys.length; i++) {
        if (records[i] != null) {
            response[i] = (byte[]) records[i].getValue(DEFAULT_BIN_NAME);
        }
    }
    return response;
}

This code works perfectly and fast when I have a single request. But when I run multiple parallel threads doing batchGets, it is extremely slow (degradation goes linear with the number of threads, e.g. 4 threads = 4x slower, 8 threads = 8x slower). I don't see much CPU or I/O usage in the monitoring, so I suspect something is waiting, but I don't know what it is.

I have tried many different configurations, and this is the config I have now (16 cores server):

    service-threads 16
    transaction-queues 16
    transaction-threads-per-queue 16
    batch-index-threads 16
    batch-max-buffers-per-queue 1000
    proto-fd-max 15000
    batch-max-requests 2000000

Any idea on whats going on?

Edit 1: Namespace configuration:

namespace test {
    replication-factor 2
    memory-size 5G
    default-ttl 0 # 30 days, use 0 to never expire/evict.
    ldt-enabled true

    storage-engine device {
           file /data/aerospike.dat
           filesize 300G
           disable-odirect true
           write-block-size 1M
           max-write-cache 1G
        }
}

And latency stats:

$ asadm -e 'show latency' 
~~~~~~~~~~~~~~~~~~~~~~~~proxy Latency~~~~~~~~~~~~~~~~~~~~~~~~
Node                     Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
   .                     Span         .      .      .       .   
p      23:00:09-GMT->23:00:19       0.0    0.0    0.0     0.0   
Number of rows: 1

~~~~~~~~~~~~~~~~~~~~~~~~query Latency~~~~~~~~~~~~~~~~~~~~~~~~
Node                     Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
   .                     Span         .      .      .       .   
p      23:00:09-GMT->23:00:19       0.0    0.0    0.0     0.0   
Number of rows: 1

~~~~~~~~~~~~~~~~~~~~~~~~~reads Latency~~~~~~~~~~~~~~~~~~~~~~~~~
Node                     Time   Ops/Sec    >1Ms    >8Ms   >64Ms   
   .                     Span         .       .       .       .   
p      23:00:09-GMT->23:00:19   15392.1   92.67   62.89    6.03   
Number of rows: 1

~~~~~~~~~~~~~~~~~~~~~~~~~udf Latency~~~~~~~~~~~~~~~~~~~~~~~~~
Node                     Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
   .                     Span         .      .      .       .   
p      23:00:09-GMT->23:00:19       0.0    0.0    0.0     0.0   
Number of rows: 1

~~~~~~~~~~~~~~~~~~~~writes_master Latency~~~~~~~~~~~~~~~~~~~~
Node                     Time   Ops/Sec   >1Ms   >8Ms   >64Ms   
   .                     Span         .      .      .       .   
p      23:00:09-GMT->23:00:19       0.0    0.0    0.0     0.0   
Number of rows: 1

Edit 2:

I think this issue is somewhat related to the java client. I did the following experiment: created two instances of my application in different machines, both accessing a single aerospike server I put a load balance to spread requests between these two servers. With this configuration, I get the double of throughput.

The aerospike server now receives and responds properly to double the traffic before, when all requests were from a single server. But if I look at my java application server, it is not consuming CPU, so I am not CPU bound. The network seems very intensive during the requests. It shows 5Gps in the server.

So, with 5 application servers, each one with a single CPU I can send 1Gps network traffic to the server and it works. But if I have a single instance of the application in a machine with 8 cores, it seems to enqueue the server requests.

My code uses a single AerospikeClient instance for all requests, as recommended in the documentation. I never close this aerospike client connection, I keep it open while the system is working.

Edit 3:

$asloglatency -h reads
  reads
  Jan 04 2016 19:12:55
               % > (ms)
  slice-to (sec)      1      8     64  ops/sec
  -------------- ------ ------ ------ --------
19:13:25    10  87.27  14.87   0.00    242.7
19:13:35    10  89.44  21.90   0.00   3432.0
19:13:45    10  89.68  26.87   0.00   4981.6
19:13:55    10  89.61  25.62   0.00   5469.9
19:14:05    10  89.89  27.56   0.00   6190.8
19:14:15    10  90.59  33.84   0.30   6138.2
19:14:25    10  89.79  29.44   0.00   5393.2


ubuntu@aerospike1:~$ asloglatency -h batch_index_reads
batch_index_reads
Jan 04 2016 19:30:36
               % > (ms)
slice-to (sec)      1      8     64  ops/sec
-------------- ------ ------ ------ --------
19:30:46    10 100.00 100.00   3.33      3.0
19:30:56    10 100.00 100.00  23.40      9.4
19:31:06    10 100.00 100.00  27.59     11.6
19:31:16    10 100.00 100.00  31.30     13.1
19:31:26    10 100.00 100.00  30.00     13.0
19:31:36    10 100.00 100.00  27.14     14.0

Edit 4:

$ asadm -e "show distribution"
~~~~~~~~~~~~~~~~~~~test - TTL Distribution in Seconds~~~~~~~~~~~~~~~~~~
      Percentage of records having ttl less than or equal to value
                          measured in Seconds
      Node   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%
aerospike1     0     0     0     0     0     0     0     0     0      0
aerospike2     0     0     0     0     0     0     0     0     0      0
Number of rows: 2

~~~~~~~~~~~~test - Object Size Distribution in Record Blocks~~~~~~~~~~~
        Percentage of records having objsz less than or equal to
                    value measured in Record Blocks
      Node   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%
aerospike1     3     3     3     3     3    65    94    97   100    100
aerospike2     3     3     3     3     3    65    94    97   100    100

Edit 5:

$ asadm -e "show stat like batch"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE                         :   aerospike1                                                                                                                           aerospike2
batch_errors                 :   0                                                                                                                                    0
batch_index_complete         :   403423                                                                                                                               3751948
batch_index_created_buffers  :   8425500                                                                                                                              169997886
batch_index_destroyed_buffers:   8423984                                                                                                                              169994324
batch_index_errors           :   3                                                                                                                                    8305
batch_index_huge_buffers     :   7075094                                                                                                                              64191339
batch_index_initiate         :   403428                                                                                                                               3760270
batch_index_queue            :   0:0,0:0,0:0,1:99,1:205,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0   1:212,0:0,1:25,1:87,0:0,1:13,1:33,1:66,1:199,1:183,1:221,1:256,1:198,1:39,1:0,0:0,0:0,0:0,1:26,0:0,0:0,0:0,0:0,0:0,1:53,0:0,0:0,0:0,0:0,1:172,1:206,0:0
batch_index_timeout          :   0                                                                                                                                    0
batch_index_unused_buffers   :   1210                                                                                                                                 1513
batch_initiate               :   0                                                                                                                                    0
batch_queue                  :   0                                                                                                                                    0
batch_timeout                :   0                                                                                                                                    0
batch_tree_count             :   0                                                                                                                                    0

$ iostat -x 1 3
Linux 4.2.0-18-generic (aerospike1)     01/05/2016  _x86_64_    (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.31    0.00    7.56    5.50    0.00   84.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.79    4.69    0.97   152.23    21.28    61.31     0.01    2.37    2.43    2.07   1.46   0.83
sdb               0.03  1828.07 1955.80   62.88 87448.89  8924.42    95.48     3.81    1.88    0.55   43.26   0.08  15.45

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.57    0.00   18.91    0.00    0.00   76.52

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.91    0.00   16.27    0.00    0.00   79.82

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    1.00    4.00     4.00    16.00     8.00     0.01    2.40   12.00    0.00   2.40   1.20
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

Could you share your namespace configuration (primarily if persisted or in memory and if persisted if data-in-memory is enabled)? And approximately how many records per batch? May also help to quantify "extremely slow", how much slower, what was the records per second with/without batch. Batch reads also affect the read histograms which could be useful here, we can grab that with `asadm -e 'show latency'`. — kporter, Dec 29 '15 at 21:33
Was the asadm command run with the single or batch get job? Can you run with the other and indicate which is which? — kporter, Dec 29 '15 at 23:39
The command was run while the system was working, with thousand of requests (batchGets and gets) — Daniel Cukier, Dec 30 '15 at 01:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/99264/discussion-between-daniel-cukier-and-kporter). — Daniel Cukier, Dec 30 '15 at 01:50
@kporter did you see the updates in this question? Any idea? — Daniel Cukier, Jan 04 '16 at 20:37
1) When you say "This code works perfectly and fast when I have a **single request**", I understood this to mean you were using the regular gets rather than the batch method, could you clarify? 2) Also we wanted to know what the distribution of object sizes was on this instance, could you run `asadm -e "show distribution"`? 3) All of your results you have provided are for a single node and I have assumed this to be the case, could you confirm this is for a single node cluster? 4) You are using an SSD and are persisting to a file, you should have better performance using SSDs as raw devices. — kporter, Jan 04 '16 at 21:27
1) "This code works perfectly and fast when I have a single request" means I do a sequencial single batchGet requests (instead of doing parallel batchGets in different threads). 2) answered in the question 3) I tried both with a single node or two nodes, and results are almost identical. 4) I tried also raw device but could not see much improvements in performance with that. — Daniel Cukier, Jan 04 '16 at 21:47
5) Have you made any changes to the clients configuration, would you be able to put your client code up somewhere accessible such as github? 6) Try increasing [batch-max-unused-buffers](http://www.aerospike.com/docs/reference/configuration/#batch-max-unused-buffers) to prevent thrashing memory. 7) `asadm -e "show stat like batch"` 8) `iostat -x 1 3` — kporter, Jan 05 '16 at 00:15
5) AerospikeClient aerospike = new AerospikeClient("10.0.240.2", port) 6) no change in performance after increasing to many different values 7) 8) in question — Daniel Cukier, Jan 05 '16 at 14:41
9) Your batch stats show 2 nodes, are you using 2 nodes now (asadm looks for any node ever connected to the cluster so in test environments such as this it may be talking to a node that you have removed)? 10) Regarding **batch-max-unused-buffers**, you can change it at runtime with `asadm -e "set-config:context=service;batch-max-unused-buffers=VALUE"`. Try to tune in modest increments until ** batch_index_destroyed_buffers** stops increasing at peak load. — kporter, Jan 06 '16 at 01:14
11) If there are two nodes then it would appear that most of the batch requests land on "aerospike2". Your aerospike1 has processed 400K batch requests while aerospike2 has processed 64M requests. This could indicate that you have 1 or a few hot keys that all reside on aerospike2, apart from that this seems really odd. 12) When you cease your batch load does all values in **batch_index_queue** return to 0:0? — kporter, Jan 06 '16 at 01:18
13) You stated that each batch request is ~2000 records, what is the range? 14) Are the client side networks able to keep up with the load? Check your %ifutils on clients and servers: `sar -n DEV 1 3`. 15) **show distribution** indicates that 50-60% of your data is larger than 8KB. With larger records you begin to lose a lot of the benefit of batch since an 8KB records will solely occupy a single packet. The real benefits come when many records can coalesce into fewer packets. — kporter, Jan 06 '16 at 01:47
9) yes, we observed the same behaviour with single node or using 2 nodes. 10) yes, we did used this command and increased the value incrementally, with no better performance 11) we are using 2 factor replication, so ideally all records are present in both servers. The client code references only one node, but I suppose it will discover other nodes in its first request, no? Should I explicitly put all aerospike servers hosts in the client code? Can this indicate a bug in the client 12) yes 13) what do you mean by range? — Daniel Cukier, Jan 06 '16 at 14:15
14) client can handle more than 2Gbs network traffic (tested with iperf). When I run the test with a single request/s in my application, client shows 170752 rxkB/s. When I run with 2 requests/s, it shows 200161 rxkB/s. Theoretically, it should be twice the number of 1 req/s. But both values are far from 2Gb/s network capacity. 15) is there any solution for that? — Daniel Cukier, Jan 06 '16 at 16:10
curiosly, aerospike2 server has 2 CPUs, and aerospike1 has 8 CPUS, so it is very odd it is responding to more requests. When I look at amc panel, TPS for both servers look the same (~3000). — Daniel Cukier, Jan 06 '16 at 16:16
16) Regarding #10, for batch requests, only the master objects will be read. The unbalanced load is probably a result of the unbalance hardware, aerospike1 is processing the batch requests faster than aerospike2 and the client must wait for the slowest node to respond to the batch. Probably should limit testing to a single server for now to reduce noise. — kporter, Jan 06 '16 at 20:34

score 1 · Answer 1 · answered Dec 30 '15 at 03:32

1

Your disk seems to be the bottleneck. If you see the asadm latency output, 60% of the reads are taking more than 8ms. You can cross check using the iostat command. Based on the latencies, I am guessing that you are using rotational drive.

According to your config, your data is not in memory. So, every read needs to hit the disk and all the reads are going to be random reads on the disk. This is not good for a rotational drive.

When data is only on disk, Aerospike recommends ssd. If you use data-in-memory option for the namespace, you can persist the data in a rotational drive. Please read more about Aerospike storage options.

answered Dec 30 '15 at 03:32

sunil

3,507
18
25

actually I am using Local SSD drive and the disk i/o load is very low. I am sure this is not a disk bottleneck in the server, because when I use multiple instances of my application, the IOP in the aerospike server increases. – Daniel Cukier Dec 30 '15 at 13:56
1

Definitely the read latencies are way higher than normal. Typically 90% or more will be under 1ms. I still suspect your i/o subsystem. You can enable microbenchmarks and storage-benchmatks and use asloglatnecy tool to get deeper insights into which is the slowest link. – sunil Dec 31 '15 at 02:39
I changed the question and put the output of asloglatency with microbenchmark and storage-benchmark – Daniel Cukier Jan 04 '16 at 19:16
the result is obvious: batch_index_reads is the bottleneck, since most operations I do are batch reads. But why is it taking too long to respond? Is there any way to improve batch reads latency performance? – Daniel Cukier Jan 04 '16 at 19:30
1

I do not see the output of storage benchmarks from Aerospike in your edited question. Am I missing something ? If not, can you add them ? – sunil Jan 06 '16 at 19:02

Low performance in Aerospike for batchRequests

1 Answers1