I have an application that does a lot of batchGets (mostly with ~2000 keys). Here is the code I use:
AerospikeClient aerospike = new AerospikeClient("10.0.240.2", port)
public byte[][] getBatch(byte[][] keys) {
Key[] aeroKeys = new Key[keys.length];
for (int i = 0; i < keys.length; i++) {
aeroKeys[i] = new Key(NAMESPACE, setName, keys[i]);
}
Record[] records = aerospike.get(batchPolicy, aeroKeys);
byte[][] response = new byte[keys.length][];
for (int i = 0; i < keys.length; i++) {
if (records[i] != null) {
response[i] = (byte[]) records[i].getValue(DEFAULT_BIN_NAME);
}
}
return response;
}
This code works perfectly and fast when I have a single request. But when I run multiple parallel threads doing batchGets, it is extremely slow (degradation goes linear with the number of threads, e.g. 4 threads = 4x slower, 8 threads = 8x slower). I don't see much CPU or I/O usage in the monitoring, so I suspect something is waiting, but I don't know what it is.
I have tried many different configurations, and this is the config I have now (16 cores server):
service-threads 16
transaction-queues 16
transaction-threads-per-queue 16
batch-index-threads 16
batch-max-buffers-per-queue 1000
proto-fd-max 15000
batch-max-requests 2000000
Any idea on whats going on?
Edit 1: Namespace configuration:
namespace test {
replication-factor 2
memory-size 5G
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
storage-engine device {
file /data/aerospike.dat
filesize 300G
disable-odirect true
write-block-size 1M
max-write-cache 1G
}
}
And latency stats:
$ asadm -e 'show latency'
~~~~~~~~~~~~~~~~~~~~~~~~proxy Latency~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
p 23:00:09-GMT->23:00:19 0.0 0.0 0.0 0.0
Number of rows: 1
~~~~~~~~~~~~~~~~~~~~~~~~query Latency~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
p 23:00:09-GMT->23:00:19 0.0 0.0 0.0 0.0
Number of rows: 1
~~~~~~~~~~~~~~~~~~~~~~~~~reads Latency~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
p 23:00:09-GMT->23:00:19 15392.1 92.67 62.89 6.03
Number of rows: 1
~~~~~~~~~~~~~~~~~~~~~~~~~udf Latency~~~~~~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
p 23:00:09-GMT->23:00:19 0.0 0.0 0.0 0.0
Number of rows: 1
~~~~~~~~~~~~~~~~~~~~writes_master Latency~~~~~~~~~~~~~~~~~~~~
Node Time Ops/Sec >1Ms >8Ms >64Ms
. Span . . . .
p 23:00:09-GMT->23:00:19 0.0 0.0 0.0 0.0
Number of rows: 1
Edit 2:
I think this issue is somewhat related to the java client. I did the following experiment: created two instances of my application in different machines, both accessing a single aerospike server I put a load balance to spread requests between these two servers. With this configuration, I get the double of throughput.
The aerospike server now receives and responds properly to double the traffic before, when all requests were from a single server. But if I look at my java application server, it is not consuming CPU, so I am not CPU bound. The network seems very intensive during the requests. It shows 5Gps in the server.
So, with 5 application servers, each one with a single CPU I can send 1Gps network traffic to the server and it works. But if I have a single instance of the application in a machine with 8 cores, it seems to enqueue the server requests.
My code uses a single AerospikeClient instance for all requests, as recommended in the documentation. I never close this aerospike client connection, I keep it open while the system is working.
Edit 3:
$asloglatency -h reads
reads
Jan 04 2016 19:12:55
% > (ms)
slice-to (sec) 1 8 64 ops/sec
-------------- ------ ------ ------ --------
19:13:25 10 87.27 14.87 0.00 242.7
19:13:35 10 89.44 21.90 0.00 3432.0
19:13:45 10 89.68 26.87 0.00 4981.6
19:13:55 10 89.61 25.62 0.00 5469.9
19:14:05 10 89.89 27.56 0.00 6190.8
19:14:15 10 90.59 33.84 0.30 6138.2
19:14:25 10 89.79 29.44 0.00 5393.2
ubuntu@aerospike1:~$ asloglatency -h batch_index_reads
batch_index_reads
Jan 04 2016 19:30:36
% > (ms)
slice-to (sec) 1 8 64 ops/sec
-------------- ------ ------ ------ --------
19:30:46 10 100.00 100.00 3.33 3.0
19:30:56 10 100.00 100.00 23.40 9.4
19:31:06 10 100.00 100.00 27.59 11.6
19:31:16 10 100.00 100.00 31.30 13.1
19:31:26 10 100.00 100.00 30.00 13.0
19:31:36 10 100.00 100.00 27.14 14.0
Edit 4:
$ asadm -e "show distribution"
~~~~~~~~~~~~~~~~~~~test - TTL Distribution in Seconds~~~~~~~~~~~~~~~~~~
Percentage of records having ttl less than or equal to value
measured in Seconds
Node 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
aerospike1 0 0 0 0 0 0 0 0 0 0
aerospike2 0 0 0 0 0 0 0 0 0 0
Number of rows: 2
~~~~~~~~~~~~test - Object Size Distribution in Record Blocks~~~~~~~~~~~
Percentage of records having objsz less than or equal to
value measured in Record Blocks
Node 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
aerospike1 3 3 3 3 3 65 94 97 100 100
aerospike2 3 3 3 3 3 65 94 97 100 100
Edit 5:
$ asadm -e "show stat like batch"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Service Statistics~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NODE : aerospike1 aerospike2
batch_errors : 0 0
batch_index_complete : 403423 3751948
batch_index_created_buffers : 8425500 169997886
batch_index_destroyed_buffers: 8423984 169994324
batch_index_errors : 3 8305
batch_index_huge_buffers : 7075094 64191339
batch_index_initiate : 403428 3760270
batch_index_queue : 0:0,0:0,0:0,1:99,1:205,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0 1:212,0:0,1:25,1:87,0:0,1:13,1:33,1:66,1:199,1:183,1:221,1:256,1:198,1:39,1:0,0:0,0:0,0:0,1:26,0:0,0:0,0:0,0:0,0:0,1:53,0:0,0:0,0:0,0:0,1:172,1:206,0:0
batch_index_timeout : 0 0
batch_index_unused_buffers : 1210 1513
batch_initiate : 0 0
batch_queue : 0 0
batch_timeout : 0 0
batch_tree_count : 0 0
$ iostat -x 1 3
Linux 4.2.0-18-generic (aerospike1) 01/05/2016 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
2.31 0.00 7.56 5.50 0.00 84.64
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.79 4.69 0.97 152.23 21.28 61.31 0.01 2.37 2.43 2.07 1.46 0.83
sdb 0.03 1828.07 1955.80 62.88 87448.89 8924.42 95.48 3.81 1.88 0.55 43.26 0.08 15.45
avg-cpu: %user %nice %system %iowait %steal %idle
4.57 0.00 18.91 0.00 0.00 76.52
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
3.91 0.00 16.27 0.00 0.00 79.82
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 1.00 4.00 4.00 16.00 8.00 0.01 2.40 12.00 0.00 2.40 1.20
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00