11

I'm writing real-time data to an empty spinning disk sequentially. (EDIT: It doesn't have to be sequential, as long as I can read it back as if it was sequential.) The data arrives at a rate of 100 MB/s and the disks have an average write speed of 120 MB/s.

Sometimes (especially as free space starts to decrease) the disk speed goes under 100 MB/s depending on where on the platter the disk is writing, and I have to drop vital data.

Is there any way to write to disk in a pattern (or some other way) to ensure a constant write speed close to the average rate? Regardless of how much data there currently is on the disk.

EDIT:

Some notes on why I think this should be possible.

When usually writing to the disk, it starts in the fast portion of the platter and then writes towards the slower parts. However, if I could write half the data to the fast part and half the data to the slow part (i.e. for 1 second it could write 50MB to the fast part and 50MB to the slow part), they should meet in the middle. I could possibly achieve a constant rate?

As a programmer, I am not sure how I can decide where on the platter the data is written or even if the OS can achieve something similar.

ronag
  • 49,529
  • 25
  • 126
  • 221
  • 1
    If the disk is almost full and the only empty portion is the slow portion, then it doesn't sound possible. – Mysticial Jun 19 '14 at 19:28
  • Yes, which is why you would want to avoid getting only the slow portion when it's getting full. – ronag Jun 19 '14 at 19:31
  • I.e. maybe writing to both slow and fast portion in an even manner all the way? – ronag Jun 19 '14 at 19:32
  • 5
    Unfortunately, trying to spread the writes over different parts of the disk will involve seeking (a lot). Seeking is slow enough that doing it (even close to) that much would drop your throughput substantially below the rated bandwidth. – Jerry Coffin Jun 19 '14 at 19:45
  • 3
    Just throw a raid 0 array at the problem. – Hans Passant Jun 19 '14 at 20:35
  • 3
    ...or a Solid State Disk. – DevSolar Jun 19 '14 at 20:44
  • @JerryCoffin You yourself decide how often you seek. It's a tradeoff between buffer sizes and throughput. The buffer size you need is normally much larger than the chunk written to the disk in one go, since the disk may have some worst-case latency for seeks due to dumping housekeeping data and doing relocations. – Kuba hasn't forgotten Monica Jun 19 '14 at 22:06
  • Raid 0 wont work due to chassi constraints and ssds are to exprnsive and to small. – ronag Jun 19 '14 at 23:51
  • What's the data source? Does it have to run on windows? Does processing of one piece affect the others? Amazon Kinesis dumping to a final S3 output might fit this, depending on the nature of the problem. – Daenyth Jun 20 '14 at 00:51
  • @ronag: Erm... what do you mean, SSD's are "too small"? – DevSolar Jun 20 '14 at 12:38
  • @DevSolar: The largest available SSD is 1 TB and is both to small and to expensive. Need at leat 2 TB. – ronag Jun 20 '14 at 13:18
  • @ronag: Ah... I was somehow fixed on the thought of "small" --> form factor. ;-) – DevSolar Jun 20 '14 at 14:12
  • Is there room for a second hard drive? This would double the output rate and after the capture is completed, you could merge the files. – rcgldr Jun 20 '14 at 16:49
  • @ronag Are you sure that spending days or weeks of programmer time to concoct a write-balancing solution that hasn't even been proven is cheaper than working with SSDs? You could even just use more hard disks and only utilize say 70% of the space (the portion that can keep up with the write rate) instead of the full disk. – Mark B Jun 20 '14 at 17:07
  • @MarkB: Valid point. However, it is a specific requirement that we must use a single disk with at least 2 TB space. I could look into buying disks with 2-4 TB but there is no available information as to when and how much the write speed degrade in relation to free space. – ronag Jun 20 '14 at 19:08
  • Almost all hard drives scan all the way across a platter before switching heads. This is because the tracks on the upper and lower surfaces of a platter are not aligned, so the drive has to perform a more complex seek when switching platter. While on a platter, a track to track step is optimized with the sectors skewed so the first sector of the next track is just before the read head after a track to track step. – rcgldr Jun 21 '14 at 01:05
  • 1
    To reduce the overhead of random access, use large I/O's. Assume a random access takes about 10ms, and average I/O rate is 100MB/s. If I/O is done 10MB at a time, that will take 100ms, so the seek would be a 10% overhead. If I/O is done 100MB at a time, the seek overhead is almost insignificant. Using large I/O part of the strategy for a k-way merge sort with a single hard drive, where k can be 12 or more. – rcgldr Jun 21 '14 at 01:08

6 Answers6

7

If I had to do this on a regular Windows system, I would use a device with a higher average write speed to give me more headroom. Expecting 100MB/s average write speed over the entire disk that is rated for 120MB/s is going to cause you trouble. Spinning hard disks don't have a constant write speed over the whole disk.

The usual solution to this problem is to buffer in RAM to cover up infrequent slow downs. The more RAM you use as a buffer, the longer the span of slowness you can handle. These are tradeoffs you have to make. If your problem is the known slowdown on the inside sectors of a rotating disk, then your device just isn't fast enough.

Another thing that might help is to access the disk as directly as possible and ensure it isn't being shared by other parts of the system. Use a separate physical device, don't format it with a filesystem, write directly to the partitioned space. Yes, you'll have to deal with some of the issues a filesystem solves for you, but you also skip a bunch of code you can't control. Even then, your app could run into scheduling issues with Windows. Windows is not a RTOS, there are not guarantees as far as timing. Again this would help more with temporary slowdowns from filesystem cleanup, flushing dirty pages, etc. It probably won't help much with the "last 100GB writes at 80MB/s" problem.

If you really are stuck with a disk that goes from 120MB/s -> 80MB/s outside-to-inside (you should test with your own code and not trust the specs from the manufacture so you know what you're dealing with), then you're going to have to play partitioning games like others have suggested. On a mechanical disk, that will introduce some serious head seeking, which may eat up your improvement. To minimize seeks, it would be even more important to ensure it's a dedicated disk the OS isn't using for anything else. Also, use large buffers and write many megabytes at a time before seeking to the end of the disk. Instead of partitioning, you could write directly to the block device and control which blocks you write to. I don't know how to do this in Windows.

To solve this on Linux, I would be tempted to test mdadm's raid0 across two partitions on the same drive and see if that works. If so, then the work is done and you don't have to write and test some complicated write mechanism.

kbyrd
  • 3,321
  • 27
  • 41
  • 1
    Buffering will not work as I will run out RAM as I am constantly writing. If the disk writes 80MB/s the last 100GB there is no way it will work. – ronag Jun 20 '14 at 12:10
  • Avoiding the file system is an interesting idea. Will look into it. – ronag Jun 20 '14 at 12:11
  • My suggestion about buffering needs to be read AFTER reading my first paragraph about needing more headroom. – kbyrd Jun 20 '14 at 13:08
  • Regarding avoiding the filesystem, this might help a bit, but mostly I believe it will help avoid temporary hiccups not the slowdown cause by writing to inner vs outer tracks. If you are sure (you tested it yourself with your code, not just trusted the manufacturers specs) that you can get sustained writes of 120MB/s on the outer portion of a disk for minutes of a time, then avoiding the filesystem isn't likely to help. I'll update my answer. – kbyrd Jun 20 '14 at 13:09
  • My problem is not infrequent slow downs. It is a constant slow down, the more is written to the disk the slower it gets until after about 70% is written it no longer keeps up with the source. Buffering will only help, as you write, for temporary slowdowns, which are not the problem here. – ronag Jun 20 '14 at 13:20
  • raid0 accross the same drive is interesting. Cool idea. – ronag Jun 20 '14 at 13:21
5

Partition the disk into two equally sized partitions. Write a few seconds worth of data alternating between the partitions. That way you get almost all of the usual sequential speed, nicely averaged. One disk seek every few seconds eats up almost no time. One seek per second reduces the usable time from 1000ms to ~990ms which is a ~1% reduction in throughput. The more RAM you can dedicate to buffering the less you have to seek.

Use more partitions to increase the averaging effect.

usr
  • 168,620
  • 35
  • 240
  • 369
  • 3
    Why was this downvoted? This is actually a pretty creative possible solution. It at least deserved a comment explaining the downvote. – rjp Jun 19 '14 at 20:23
  • 1
    The seek from one partition to the other and back will easily eat up the benefit. (I know this statement is just as unfounded as usr's statement that his solution might work.) Not my downvote though. – DevSolar Jun 19 '14 at 20:47
  • @DevSolar if you have one seek every few seconds it will eat up almost nothing. I have clarified. I understand the downvote now. The number seemed made up. – usr Jun 19 '14 at 21:26
  • This might be a good idea. Though, how will the partitions be mapped on the disk? – ronag Jun 20 '14 at 12:14
  • @ronag you can decide the exact partitioning layout when creating them. – usr Jun 20 '14 at 12:15
  • Yes, but how does the partitioning layout map to the actual platters? – ronag Jun 20 '14 at 12:16
  • If it use 2 platters for the first partition and 2 platters for the second (instead of splitting the partition inside the platters) I am still left with the same problem. – ronag Jun 20 '14 at 12:16
  • The platters in the physical disk are not important. From benchmark tools I know that speed decreases monotonically from sector 0 to the last sector. Create one partition from 0 to N/2 and the 2nd partition to fill the remaining space. That gives you the speed profile that you need. (Again: More partitions will smooth out performance even more. So make your program be able to deal with an arbitrary number of output files just in case.) – usr Jun 20 '14 at 12:20
  • Btw, you can use the Windows defragmentation APIs to move files to different parts of the disk. Or, you can create a single file that spans the whole disk. That way you might be able to avoid using partitions entirely. Your choice. Do what's less work. – usr Jun 20 '14 at 12:23
  • @usr - ronag is correct. Almost all hard drives will scan across a surface before going to the next surface. – rcgldr Jun 21 '14 at 09:04
  • @rcgldr not sure what you mean. What the hard disk does doesn't matter. Only perceived performance does. And that decreases monotonically without jumps. You can confirm this with any benchmarking software that shows performance over disk position (HD Tune does).; Also, why would the disk scan platters one by one? I'd expect it to "RAID" the platters to increase speed and scan them at the same time. The heads are all at the same position anyway. – usr Jun 21 '14 at 09:10
  • @usr hard drives scan surfaces one at a time because the formatting process is done on each surface independently and the tracks of a cylinder are not aligned vertically. When switching surfaces, a seek of several tracks has to be done. The hard drives are optimized to step one track inwards or outwards at a time, with the sectors staggered (skewed) so that the first sector of a track just after a track step will be just before the read / write head. – rcgldr Jun 21 '14 at 09:59
  • @rcgldr that is interesting to know. This should indeed cause tiny jumps in throughput. So tiny that I have not ever noticed them or considered them to be noise/concurrent activity. Shouldn't matter for the OP. – usr Jun 21 '14 at 10:02
5

I fear this may be more difficult than you realize:

  • If your average 120 MB/s write speed is the manufacturer's value then it is most likely "optimistic" at best.
  • Even a benchmarked write speed is usually done on a non-partitioned/formatted drive and will be higher than what you'd typically see in actual use (how much higher is a good question).
  • A more important value is the drive's minimum write speed. For example, from Tom's Hardware 2013 HDD Benchmarks a drive with a 120 MB/s average has a 76 MB/s minimum.
  • A drive that is being used by other applications at the same time (e.g., Windows) will have a much lower write speed.
  • An even more important value is the drives actual measured performance. I would make a simple application similar to your use case that writes data to the drive as fast as possible until it fills the drive. Do this a few (dozen) times to get a more realistic average/minimum/maximum write speed value...it will likely be lower than you'd expect.
  • As you noted, even if your "real" average write speed is higher than 100 MB/s you run into issues if you run into slow write speeds just before the disk fills up, assuming you don't have somewhere else to write the data to. Using a buffer doesn't help in this case.
  • I'm not sure if you can actually specify a physical location to write to on the hard drive these days without getting into the drive's firmware. Even if you could this would be my last choice for a solution.

A few specific things I would look at to solve your problem:

  • Measure the "real" write performance of the drive to see if its fast enough. This gives you an idea of how far behind you actually are.
  • Put the OS on a separate drive to ensure the data drive is not being used by anything other than your application.
  • Get faster drives (either HDD or SDD). It is fine to use the manufacturer's write speeds as an initial guide but test them thoroughly as well.
  • Get more drives and put them into a RAID0 (or similar) configuration for faster write access. You'll again want to actually test this to confirm it works for you.
uesp
  • 6,194
  • 20
  • 15
3

You could implement the strategy of alternating writes bewteen the inside and the outside by directly controlling the disk write locations. Under Windows you can open a disk like "\.\PhysicalDriveX" and control where it writes. For more info see

http://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

ScottMcP-MVP
  • 10,337
  • 2
  • 15
  • 15
3

First of all, I hope you are using raw disks and not a filesystem. If you're using a filesystem, you must:

  1. Create an empty, non-sparse file that's as large as the filesystem will fit.

  2. Obtain a mapping from the logical file positions to disk blocks.

  3. Reverse this mapping, so that you can map from disk blocks to logical file positions. Of course some blocks are unavailable due to filesystem's own use.

At this point, the disk looks like a raw disk that you access by disk block. It's a valid assumption that this block addressing is mostly monotonous to the physical cylinder number. IOW if you increase the disk block number, the cylinder number will never decrease (or never increase -- depending on the drive's LBA to physical mapping order).

Also, note that a disk's average write speed may be given per cylinder or per unit of storage. How would you know? You need the latter number, and the only sure way to get it is to benchmark it yourself. You need to fill the entire disk with data, by repeatedly writing a zero page to the disk, going block by block, and divide the total amount of data written by the amount it took. You need to be accessing the disk or the file in the direct mode. This should disable the OS buffering for the file data, and not for the filesystem metadata (if not using a raw disk).

At this point, all you need to do is to write data blocks of sensible sizes at the two extremes of the block numbers: you need to fill the disk from both ends inwards. The size of the data blocks depends on the bandwidth wastage you can allow for seeks. You should also assume that the hard drive might seek once in a while to update its housekeeping data. Assuming a worst-case seek taking 15ms, you waste 1.5% of per-second bandwidth for each seek. Assuming you can spare no more than 5% of bandwidth, with 1 seek/s on average for the drive itself, you can seek twice per second. Thus your block size needs to be your_bandwith_per_second/2. This bandwidth is not the disk bandwidth, but the bandwidth of your data source.

Alas, if only things where this easy. It generally turns out that the bandwidth at the middle of the disk is not the average bandwidth. During your benchmark you must also take a note of write speed over smaller sections of the disk, say every 1% of the disk. This way, when writing into each section of the disk, you can figure out how to split the data between the "low" and the "high" section that you're writing to. Suppose that you're starting out at 0% and 99% positions on the disk, and the low position has a bandwidth of mean*1.5, and the high position has a bandwidth of mean*0.8, where mean is your desired mean bandwidth. You'll then need to write 100% * 1.5/(0.8+1.5) of the data into the low position, and the remainder (100% * 0.8/(0.8+1.5)) into the slower high position.

The size of your buffer needs to be larger than just the block size, since you must assume some worst-case latency for the hard drive if it hits bad blocks and needs to relocate data, etc. I'd say a 3 second buffer may be reasonable. Optionally it can grow by itself if latencies you measure while your software runs turn out higher. This buffer must be locked ("pinned") to physical memory so that it's not subject to swapping.

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
1

Another possible option is to destroke (or short stroke) a hard drive. If you start with a 4TB or larger drive and destroke it to 2TB, only the outer portions of the platters will be used, resulting in a faster throughput rate. The issue would be getting the software that issues vendor unique commands to a hard drive to destroke it.

rcgldr
  • 27,407
  • 3
  • 36
  • 61