3

I have a question which is confusing me and my task is to work out fragmentation.

stat() for a file:
st_size = 10520
st_blksize = 4096
st_blocks = 24

I have read in some places that st_blksize is the general block size of the file system which in this case is 4096 but that file would fit into 3 blocks, 10520 / 512 is 20.5 meaning that there are 3.5 blocks of unused space, even though it is allocated. Does this mean that there are 1792 unused bytes in this file (fragmentation)?

As I have mentioned I read into this a fair bit and have read a lot of contradicting texts, would like someone to clear this up once and for all!

IronMan007
  • 105
  • 4
  • 14
Charlie
  • 1,308
  • 5
  • 14
  • 24
  • Short answer: bookkeeping. Files aren't necessarily contiguous so there have to be pointers from block to block. st_blocks includes space allocated for pointer blocks. See inodes and such for the down and dirty. Also allocated space can actually be less than the file size since the file may contain file holes. Hopefully someone has time for a more complete answer or references. – Duck Dec 11 '11 at 02:25
  • ok but am I on the right tracks to working out how to work out file fragmentation or am i WAYYYY off? – Charlie Dec 11 '11 at 02:34
  • You aren't way off but there is more to it. Sorry I thought this was tagged linux. If so, definitely read some good material on linux file systems, VFS, etc. If you are headed where I suspect it is going to be really easy to trash your file system if you aren't careful. – Duck Dec 11 '11 at 02:55
  • Im just working on a uni project where we have to recursively list all files in a directory along with the size and path and then at the end of the output give the % of file fragmentation. It is proving much harder than I thought it would be but i think i may be starting to understand it more and more – Charlie Dec 11 '11 at 03:01

2 Answers2

1

I don't think your project is really solvable at the stat(2) API layer. Consider the case of a file 4096 bytes long. Presume it was created by iteratively appending 512 byte blocks over and over again. Presume that the filesystem was completely full, except for one 512 byte block, for each and every write. Presume that the 512 byte block available for each write was located in a randomly available spot on the disk.

This file is 100% fragmented -- no two blocks are near each other.

And yet, a measure based solely on the stat(2) variables might well show that there are no wasted blocks anywhere in the file.

When trying to track down an answer to your actual question, I got as far as ext3_write_begin() before being called away -- hope this is a useful starting point for your search.

Update

If you're interested in finding fragmentation, I think the place to start is the bmap command from the debugfs(8) program:

debugfs:  bmap sars_first_radio_show.zip 0
94441752
debugfs:  bmap sars_first_radio_show.zip 1
94441781
debugfs:  bmap sars_first_radio_show.zip 2
94441782
debugfs:  bmap sars_first_radio_show.zip 3
94441783
debugfs:  bmap sars_first_radio_show.zip 4
94441784
debugfs:  bmap sars_first_radio_show.zip 5
94459905
debugfs:  bmap sars_first_radio_show.zip 6
95126019
debugfs:  bmap sars_first_radio_show.zip 7
95126020
debugfs:  bmap sars_first_radio_show.zip 8
95126021
debugfs:  bmap sars_first_radio_show.zip 9
95126022
debugfs:  

This shows the first ten blocks for the file sars_first_radio_show.zip; you can see that the blocks aren't all contiguous: 944417{52,81,82,83,84}, 94459905, 951260{19,20,21,22}.

You could either script an answer around debugfs(8) or you could use the libext2fs library routines yourself. It would be a significant step up in complexity compared to the stat(2) exercises you were going through -- but the answers would mean something, rather than just be a vague guess.

sarnold
  • 102,305
  • 22
  • 181
  • 238
  • Ok so I have done more research into this. I have found out that the internal fragmentation will only occur in the last block and its location can be worked out from the inode number (not exactly sure how). does this sound about write? – Charlie Dec 11 '11 at 18:44
  • The "fragmentation" of a last block would actually be due to [inefficient tail storage](http://en.wikipedia.org/wiki/Block_suballocation) -- something very different from usual fragmentation. But perhaps still interesting to look into -- depending upon how a filesystem is used, it can have either nearly no effect (storage of movies, photos, etc.) or huge effects (one-email-address per file, maildir storage for small emails, etc.) – sarnold Dec 11 '11 at 23:46
  • Ok thanks for the advice, im going to look into both now. here is the spec im working from: 3. The next step would be to work out how you can traverse through all of the directories and access each file’s starting i-node. From that point, you can then identify the last block of the file and work out how much space is left within the block. I feel like this is a complete lie :-(! – Charlie Dec 12 '11 at 00:43
  • Well, it's a little bit of a lie ("starting i-node"? each file has exactly _one_ inode) and some extraneous information (you don't need to actually _identify_ the last block of the file to figure out how much of its final block is wasted -- and there is no portable, standard Unix API for you to discover the block numbers anyway). I feel like your best bet is simply to take `st_size % 512`, because all the rest of this feels far outside the range of _usual_ OS courses. – sarnold Dec 12 '11 at 01:37
  • yeah, its a systems software module and we are using minix which doesnt work in the usual way of direct, indirect double indirect and triple indirect, it used 7 direct 2 in direct and one that is unused although the only block that would have internal slack space would be the final block, ive now spent about 30 hours googling the same things and getting nowhere! I feel like st_size is the only option but I know there must be another way to do it! – Charlie Dec 12 '11 at 01:42
0

IIRC The st_blocks, has to report the st_size / 512 so various Linux programs work properly. It doesn't necessarily have anything to do with the how many blocks are allocated on the file system. Furthermore the st_blksize only tells higher level applications what size reads and writes to send down to the syscalls for optimal performance. Once more this does not necessarily mean the file system is actually storing things in these block sizes.

The real answer to your questions regarding file fragmentation will be highly dependant on the FS you're working with. I recommend starting your reading at a lower level

sehafoc
  • 866
  • 6
  • 9