0

i have next situation. we read packets (700-800 bytes) from tap interface process them and send to this interface back. we start 3 processes for one packet

when in 1-10 miliseconds come 1-8 packets processing them takes 1-5 ms for each. but when we got more (20-50) it seems like something happens and schedulers can just sleep for 1-3 seconds.

first i think it was due to nif to write packets in interface take 1-2ms. but when we send packets from remsch - situation repeats.

i run this func 'lists:foreach(fun(_) -> lager:log(error, [], "queue ~p ~n", [erlang:statistics(run_queue_lengths)]), timer:sleep(5) end, lists:seq(1, 200* 5)).' and started send packets

result is awful

2018-03-29 21:43:52.474 [error]  queue [1,0]
2018-03-29 21:43:52.480 [error]  queue [1,0]
2018-03-29 21:43:52.486 [error]  queue [0,0]
2018-03-29 21:43:52.492 [error]  queue [0,0]
2018-03-29 21:43:52.498 [error]  queue [0,0]
2018-03-29 21:43:52.504 [error]  queue [0,0]
2018-03-29 21:43:52.510 [error]  queue [1,0]
2018-03-29 21:43:52.516 [error]  queue [0,0]
2018-03-29 21:43:56.422 [error]  queue [0,0]
2018-03-29 21:43:56.630 [error]  queue [1,2]
2018-03-29 21:43:56.900 [error]  queue [0,0]
2018-03-29 21:43:56.905 [error]  queue [0,0]
2018-03-29 21:43:57.548 [error]  queue [1,2]
2018-03-29 21:43:57.554 [error]  queue [0,4]
2018-03-29 21:43:57.560 [error]  queue [1,0]
2018-03-29 21:43:57.567 [error]  queue [2,1]
2018-03-29 21:43:57.572 [error]  queue [0,0]
2018-03-29 21:43:57.578 [error]  queue [0,0]
2018-03-29 21:43:57.584 [error]  queue [0,0]
2018-03-29 21:43:57.590 [error]  queue [0,1]

schedulers definitely has tasks to run but it don't

any suggestions where to look?

tried many times and can see such result sleep occurs when one of schedulers has a bigger task queue

2018-03-30 16:47:57.952 [trace]  alive [0,0]
2018-03-30 16:47:57.963 [trace]  alive [1,0]
2018-03-30 16:47:57.974 [trace]  alive [0,0]
2018-03-30 16:47:57.985 [trace]  alive [1,0]
2018-03-30 16:47:57.996 [trace]  alive [0,0]
2018-03-30 16:47:58.410 [trace]  alive [40,3]
2018-03-30 16:47:58.431 [trace]  alive [20,1]
2018-03-30 16:47:58.450 [trace]  alive [14,0]
2018-03-30 16:47:58.461 [trace]  alive [0,0]
2018-03-30 16:47:58.472 [trace]  alive [0,0]
2018-03-30 16:47:58.483 [trace]  alive [1,0]
2018-03-30 16:47:58.494 [trace]  alive [0,0]

and

2018-03-30 17:04:59.446 [trace]  alive [0,0]
2018-03-30 17:04:59.839 [trace]  alive [0,0]
2018-03-30 17:05:00.367 [trace]  alive [3,0]
2018-03-30 17:05:00.676 [trace]  alive [70,1]
2018-03-30 17:05:01.534 [trace]  alive [0,0]
2018-03-30 17:05:01.702 [trace]  alive [2,0]
2018-03-30 17:05:01.713 [trace]  alive [0,0]
2018-03-30 17:05:01.724 [trace]  alive [0,0]
  • Have you tried running your NIF on dirty schedulers instead of regular schedulers? The numbers you've provided suggest that you're likely violating the maximum amount of time a NIF should run on a regular scheduler. To verify this you might turn on `long_schedule` monitoring via [erlang:system_monitor/2](http://erlang.org/doc/man/erlang.html#system_monitor-2) for events exceeding 1-2ms. – Steve Vinoski Mar 30 '18 at 03:33
  • @SteveVinoski already run on dirty. 2 cores. 2 regural schedulers and 2 dirty – Alexander Shavelev Mar 30 '18 at 06:14

0 Answers0