Linux high load with ImageMagick convert utility and server freezes (blktrack output attached)

Question

I'm using ImageMagick to convert a JPG to TIF file, and I'm using the limit options for Imagemagick as below:

/usr/bin/convert -limit memory 256 -limit map 512 subjectfile.jpg -colorspace Gray -depth 8 -resample 200x200 output.tif

When I run the above command the load on the server suddenly goes very high and the CPUs get in wait state for most of the time as below:

Tasks: 245 total,   3 running, 241 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  1.0%sy,  0.0%ni,  0.0%id, 93.1%wa,  0.0%hi,  5.0%si,  0.0%st
Cpu3  :  6.9%us,  1.0%sy,  0.0%ni, 90.1%id,  0.0%wa,  1.0%hi,  1.0%si,  0.0%st
Mem:   4148160k total,  3980380k used,   167780k free,    18012k buffers
Swap:  4096552k total,       96k used,  4096456k free,  3339884k cached

The iostat during this shows the following:

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  7361.00 62.00 137.00  3712.00 37180.00   410.97   128.13  120.48   5.04 100.20
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00  7361.00 62.00 137.00  3712.00 37180.00   410.97   128.13  120.48   5.04 100.20
sdb               0.00  7368.00  0.00 144.00     0.00 33136.00   460.22   133.84  203.48   6.96 100.20
sdb1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb3              0.00  7368.00  0.00 144.00     0.00 33136.00   460.22   133.84  203.48   6.96 100.20
md1               0.00     0.00 61.00 17711.00  3648.00 70844.00     8.38     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1193.00  0.00 470.00     0.00 14200.00    60.43    91.07  216.34   2.02  95.00
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00  1193.00  0.00 470.00     0.00 14200.00    60.43    91.07  216.34   2.02  95.00
sdb               0.00  1138.00  0.00 410.00     0.00  8700.00    42.44   141.31  119.61   2.44 100.20
sdb1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb3              0.00  1138.00  0.00 410.00     0.00  8700.00    42.44   141.31  119.61   2.44 100.20
md1               0.00     0.00  0.00 5226.00     0.00 20904.00     8.00     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  1472.28  0.00 483.17     0.00  7821.78    32.38     5.52   11.43   0.52  25.05
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda3              0.00  1472.28  0.00 483.17     0.00  7821.78    32.38     5.52   11.43   0.52  25.05
sdb               0.00  1511.88  0.00 410.89     0.00 10047.52    48.91   143.60  171.46   2.42  99.31
sdb1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb3              0.00  1511.88  0.00 410.89     0.00 10047.52    48.91   143.60  171.46   2.42  99.31
md1               0.00     0.00  0.00 778.22     0.00  3112.87     8.00     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

I'm not very familiar to Linux I/O performance but by reading on the internet I managed to get some stats from blktrace while this happened, which show as:

==================== All Devices ====================
          ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000499   0.000486353   1.158217913      172004
Q2G               0.000000258   0.000059510   0.198865402      343500
S2G               0.000128922   0.010945336   0.198863747        1840
G2I               0.000000214   0.000000517   0.000168407      343504
Q2M               0.000000190   0.000000519   0.000122999      344516
I2D               0.000000879   0.016310824   0.305521347      342948
M2D               0.000000951   0.007473560   0.205691209      344492
D2C               0.000083899   0.002041770   0.160452919      171859
Q2C               0.000092851   0.013953825   0.317186332      171859

==================== Device Overhead ====================

       DEV |       Q2G       G2I       Q2M       I2D       D2C
---------- | --------- --------- --------- --------- ---------
 (  8,  0) |   0.8524%   0.0074%   0.0075% 233.2591%  14.6323%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.8524%   0.0074%   0.0075% 233.2591%  14.6323%

==================== Device Merge Information ====================

       DEV |       #Q       #D   Ratio |   BLKmin   BLKavg   BLKmax    Total
---------- | -------- -------- ------- | -------- -------- -------- --------
 (  8,  0) |   343516   343516     1.0 |        8       16     1024  5650976

==================== Device Q2Q Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8,  0) |          172005      27058614.9               0 | 0(123703)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |          172005      27058614.9               0 | 0(123703)

==================== Device D2D Seek Information ====================

       DEV |          NSEEKS            MEAN          MEDIAN | MODE           
---------- | --------------- --------------- --------------- | ---------------
 (  8,  0) |          343516       9204796.3               0 | 0(310240)
---------- | --------------- --------------- --------------- | ---------------
   Overall |          NSEEKS            MEAN          MEDIAN | MODE           
   Average |          343516       9204796.3               0 | 0(310240)

Using btt with -A options shows, this:

==================== Per Process ====================

          Q2Qdm           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

          Q2Adm           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

          Q2Cdm           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------


            Q2Q           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
convert           0.085368267   9.765798951  24.050329666           3
md1_raid1         0.000000730   0.000493657   1.158217913      169459
mysqld            0.000001386   0.018154085  14.221072636        2146
sh                0.005889458   0.322064972   1.423632298           5

            Q2A           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

            Q2G           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
convert           0.000000539   0.000003194   0.000005260          16
md1_raid1         0.000000258   0.000060580   0.198865402      333440
mysqld            0.000000270   0.000028381   0.058359194        8476
sh                0.000000506   0.000000827   0.000001610          24
            S2G           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------
md1_raid1         0.000128922   0.010842039   0.198863747        1836
mysqld            0.058358625   0.058358625   0.058358625           4

I'm using the following I/O Schedular:

# cat /sys/block/sd*/queue/scheduler 
noop anticipatory deadline [cfq] 
noop anticipatory deadline [cfq]

So my question is why average (AVG) Q2Q value is so high for convert (ImageMagick) utility when I use it with limit options:

convert           0.085368267   9.765798951  24.050329666           3

I don't see the load going hight problem when I use convert (ImageMagick) without -limit options, so can you please help me why the load shoots high when I try to limit the resource used by ImageMagick's convert utility with -limit option and how can I fix the issue.

score 2 · Accepted Answer · answered Feb 12 '13 at 09:26

I have run your exact command line (although with a different picture I presume ;-) ) with and without the limit options. And I understood the problem: the limit options unit are in bytes. What does it mean?

You set 256B for the maximum memory and 512B for the file mapping in memory. Thus instead of having nice big chunk of buffer read, you read the size of FS block (or even less if you have a 4K hard disk). This is something that would generate a lots of unnecessary IOs.

What you want to do is set 256MiB and 512MiB (beware to respect the case MiB and not mib or MIB!):

/usr/bin/convert -limit memory 256MiB -limit map 512MiB subjectfile.jpg -colorspace Gray -depth 8 -resample 200x200 output.tif

With this command I get about the same real time as when not using limit and I have more or less the same IO activities than without the limit options.

Thanks! I'm running an image converting server, and cpu sys time spiked after I added the limits. Your answer fixed my issue! — Slashterix, Apr 10 '15 at 16:45

score 1 · Answer 2 · answered Jan 26 '13 at 22:45

High server load is not always bad news, because the load average is a reflection of how much work the CPU has queued up to do. If those processes are waiting on Disk IO for example, and the disk happens to be quite slow, then High load averages will result, but without any obvious impact on performance.

Not only that, but looking at your CPU usage, it doesn't appear the CPUs are doing much else apart from running Image magik. Its perfectly normal that a process take a majority of CPU time if its the only significant process running.

If you really want to make sure it does not hog the CPU, then use the nice command to increase its niceness. But again, if its the only process of note running, then odds are even a recnice wont make much difference.

Upshot: If the convert process is not adversely affecting the performance of other programs, dont worry about it!

Thanks for your response, it's causing the server to crash occasionally as sometimes the load goes over 100%, so the server at the end needs a hard reboot to get back to normal. I don't know how I can fix this high load problem. — Imran Imtiaz, Jan 26 '13 at 23:14

score 1 · Answer 3 · answered Jan 27 '13 at 00:50

When the Q2Q goes high, it means that there are time gap between the successive requests coming to the device queue from the application. Application issues an IO, it doesn't issue the next IO from the next sector of the disk, disk head goes to seek somewhere else and when it finds the proper one sector, application issues the next IO. This kind of IO is called random IO. Random IO increases the Q2Q time or time between requests sent to the block layer.

You haven't shown a comparative blktrace without the --limit option but I would assume that when used with limit, the convert command isn't doing random IO or at least the randomness reduces to some extent.

See this line from iostat. sdb3 0.00 1138.00 0.00 410.00 0.00 8700.00 42.44 141.31 119.61 2.44 100.20

You see that await is 119.61 while svctm is 2.44 and %util is 100.20. Note that the avrqu-sz is large, quite large at 141.31. So, the IOs are waiting in the block layer before being serviced and we see a high value of avgqu-sz also meaning that there are pending IOs also meaning to go to block layer. Again, an indication that we are not seeing a sequential stream of IOs but random chunks and that is why, svctm is perfectly fine while await and avgqu-sz are high.

So, it boils down to how the application is issuing IO. I don't think you can control too much of that about. So, try without limiting the convert command and see whether you can get a stream of sequential IO.

Also, can you please tell what is the throughput of the disk, in disk vendor's term, IOPS. So that I can also tell whether you are saturating your disks.

And btw,good job with the blktrace output. You might also want to go ahead and make a pdf using seekwatcher utility.

Thanks for your response, here's the information about both disks in raid: http://pastebin.com/fhqrR7N1 - Looking forward to hear from you on disks. — Imran Imtiaz, Jan 27 '13 at 23:21
And these are the results from hdparm test: http://pastebin.com/kpNcjJkB — Imran Imtiaz, Jan 27 '13 at 23:55
Also one question that as both drives are in raid, so why sdb is showing more load than sda? as both devices should have same I/O requests. — Imran Imtiaz, Jan 27 '13 at 23:57
Only on one partition because the IO is not being striped. Application needs to write data in striped fashion and subsequently read the data in striped fashion. In your case, something of these two scenarios is borked. The two disks you mentioned are different model. However, I would rather say that is is some sort of application bottleneck before I consider this as a disk bottleneck. — Soham Chakraborty, Jan 28 '13 at 11:09

score 0 · Answer 4 · answered Jan 27 '13 at 00:40

From the arguments you have to ImageMagick, you look to be desiring a 200x200 TIFF file... yet your iotop output shows a 70MB/s write out to your RAID array and an absurdly high number of I/O transactions per second.

Here's my suspicion of what is happening, though I don't know enough about ImageMagick to confirm it. By limiting the amount of memory to 512 bytes, in order to process the image ImageMagick is forced to use the disk as an intermediary storage method since it cannot use memory. As a result, it ends up consuming a huge amount of I/O throughput as it pages chunks in and out, causing the system to lock up.

Why are you setting the limits so low - in fact, why do you have them at all? Setting them higher will reduce the amount of paging that ImageMagick has to do significantly. If you don't trust the images coming in and you are attempting to keep the images from consuming too much resources, put a cap on the size of the images you permit conversion of, set the memory limit to, say, 2-4X the cap you specify, limit the number of simultaneous conversions, and for extra paranoia, limit the command execution time to prevent an intentionally malformed image from cycling forever.

Linux high load with ImageMagick convert utility and server freezes (blktrack output attached)

4 Answers4