5

Let's assume that on a hard drive I have some very large data file of a sequence of characters:

ABRDZ....

My question is as follows, if the head is positioned at the beginning of the file, and I need 5 characters every 1000 positions interval, would it better be to do a Seek (since I know where to look) or simply have a large buffer that just reads sequentially then do job in memory.

Naively I'd have answered that reading 'A' then seek to read 'V' is faster than >> reading all the file until ,say, position 200 (the position of 'V'). Ok, this is just an example, since smallest I/O is 512bytes.

Edit: my previous self-naive-answer is partly justified by the following case: given a 100Gb file I need the first and the last charcters; Here I obviously would do a seek .... right?

Maybe there is a trade off between how "long" is the seek vs how much data to retrieve?

Can someone clarify this to me?

Bart
  • 19,692
  • 7
  • 68
  • 77
DED
  • 143
  • 11
  • Huge assumption that the file is and will remain contiguous! – Tony Hopkinson Jun 11 '12 at 15:40
  • True, but there should be ways to ensure that, isn't it? Even more, defragmentation would cause more damage to sequential reads than to a seek. – DED Jun 11 '12 at 16:34
  • Ensuring contiguity isn't free. Modeling framented files is less simple. I would have thought it had pretty much the same impact on serial read and seek. Horribly so with an interval that was`a block or greater. – Tony Hopkinson Jun 11 '12 at 20:59
  • I had the same question. I need to store an often accessed piece of data in a very large file. It is more convenient for me to store it at the end of the file. I wanted to consider the effects on performance: would seeking to near the end of the file will affect the performance much? Given that on a 4K blocks fragmented file, the file system needs somehow to read and navigate the linked-list of blocks to get to the seeked location! Would the seek time be equivalent to read-time? Is the list of allocated blocks stored somewhere else in a contiguous section, which would improve seed time? – Philibert Perusse Jan 15 '13 at 16:45
  • 1
    @PhilibertPerusse I'd say: put the your frequently accessed data at the beginning of the file, load it into memory, and keep it there. – DED Jan 29 '13 at 09:45

1 Answers1

2

[UPDATE] Generally, from your original numbers 5 out of every 1000, (Ill assume that the 5 bytes is part of the 1000, thus making your step count 1000), if your step count is less 2x your block size, than my original answer is a pretty good explanation. It does get a bit more tricky once you get past 2x your HD block size, because at that point, you would easily be wasting read times, when you could be speeding up by seeking past un-used (or for that matter un-necessary) HD blocks.

[ORIGINAL] Well, this is an extremely interesting question, with what I beleive to be an equally interesting answer (also somewhat complex). I think that actually this comes down to a couple of other questions, like how big is the block size you have implemented on your drive (or the drive your software is going to run on). If your block size is 4KB, then the (true) minimum your hard drive will get for you at a time is 4096 bytes. In your case if you truly need 5 chars every 1000, then if you did this with ALL disk IO, then you would be essentially re-reading the same block 4 times, and doing 3 seeks in between (REALLY NOT EFFICIENT).

My personal belief is that you could (if you wanted to be drive efficient) in your code, try to understand what the block size of the drive that you are using is, then use that size number to know how many bytes at a time you should bring into RAM. This way you wouldn't have to have a HUGE RAM buffer, but at the same time not really have to SEEK, nor would you be wasting (or performing) any extra reads.

IS THIS THE MOST EFFICIENT. I dont think it is the most efficient, but it may be good enough for the performance you need, who knows. I do think that even if the read head is where you want it to be, that if you perform algorithmic work in the middle of each block read, rather than reading the whole file all at once, that you will lose time in waiting for the next rotation of the drive platters. Whereas, if you were to read it all at once, the drive should be able to perform a sequential read of all parts of the file at once. Again not as simple though, as if your file is truly more than 1 block, on a rotational drive, you may suffer IF your drive has not been defragmented as it may have to perform random seeks just to get to the next block.

Sorry, for the long winded answer, but par usual, there is no simple answer in your case.

I do think that overall performance would PROBABLY be better if you simply read the whole file at once. There is no way to assure this, as each system is going to have inherently different parameters of their drive setup, etc...

trumpetlicks
  • 7,033
  • 2
  • 19
  • 33
  • Aha! Thank you, you touched right on my issue "if your step count is less 2x your block size" This looks like a criteria for when seeking is better. Do you have a reference for that? – DED Jun 11 '12 at 16:31
  • Unfortunately, I dont have a reference for that, this is from my own experiance :-) Sorry.... – trumpetlicks Jun 11 '12 at 18:42