2

For example if i have lots of data entry's stored in a file, each with different sizes, and i have 1000 entries which makes the file like 100MB large, if i then wanted to remove an entry in the middle of the file which is size of 50KB, how can i remove that empty 50KB of bytes in the file without moving all the end bytes up to fill it?

I am using winapi functions such as these for the file management:

CreateFile, WriteFile, ReadFile and SetFilePointerEx

phoxis
  • 60,131
  • 14
  • 81
  • 117
Kaije
  • 2,631
  • 6
  • 38
  • 40
  • 6
    Your question doesn't make sense; you're asking how to remove some data in a file without making it shorter. – maerics Aug 22 '11 at 13:38
  • i do want to make it shorter, i mean how can i do it without re writing the file every time i delete an entry. e.g. in access, if you have a massive database like sized 1gb, and all you do is edit/delete 1 entry and save it, it doesn't take half an hour to re write all the entry's back to the file.. – Kaije Aug 22 '11 at 13:43
  • There is no way to do what you asked for without rewriting everything past that point ("moving it up"). – R.. GitHub STOP HELPING ICE Aug 22 '11 at 13:46
  • Normal file systems does not provide insertion or deletion in the middle of files. (You'd have to use record based filesystems, typically only found on mainframes to do something similar to that, or build your own on top of normal files) – nos Aug 22 '11 at 14:02

3 Answers3

7

If you really want to do that, set a flag in your entry. When you want to remove an entry from your file, simply invalidate that flag(logical removal) w/o deleting it physically. The next time you add an entry, just go through the file, look for the first invalidated entry, and overwrite it. If all are validated, append it to the end. This takes O(1) time to remove an entry and O(n) to add a new entry, assuming that reading/writing a single entry from/to disk is the basic operation.

You can even optimize it further. At the beginning of the file, store a bit map(1 for invalidated). E.g., 0001000... represents that the 4th entry in your file is invalidated. When you add an entry, search for the first 1 in the bit map and use Random file I/O (in contrast with sequential file I/O) to redirect the file pointer to that entry-to-overwrite directly. Adding in this way only takes O(1) time.

Oh, I notice your comment. If you want to do it efficiently with entry removed physically, a simple way is to swap the entry-to-remove with the very last one in your file and remove the last one, assuming your entries are not sorted. The time is also good, which is O(1) for both adding and removing.

Edit: Just as Joe mentioned, this requires that all of your entries have the same size. You can implement one with variable length of entries, but that'll be more complicated than the one in discussion here.

Eric Z
  • 14,327
  • 7
  • 45
  • 69
  • That's right. So is the disk seeking time for random file I/O ;) – Eric Z Aug 22 '11 at 14:44
  • How large would that bitmap have to be? Besides, even with this bitmap, insertion would still be O(n). Likely faster then the O(n) for walking the objects themselves, but not O(1). – IInspectable Feb 28 '17 at 11:38
1

Let A = start of file, B = start of block to remove, C = end of block to remove

CreateFile with flag FILE_FLAG_RANDOM_ACCESS

SetFilePointerEx to position C, read to EOF into buffer (this may be a large buffer given your file size. Be careful with gigantic records, because any File IO operation now has to allocate virtual memory of the record size to do any simple operation such as move).

Copy buffer to position B in file

Should now be at position B + sizeof(block C). Call SetEndOfFile to truncate the file at that position, then close.

Note that this could be done way easier with the memmove function. However this requires you to map the entire file into memory, make the move, and write it back out. This is great for small files, but files larger than 50-100MB I would caution you about having enough available contiguous virtual address space.

mattypiper
  • 1,222
  • 8
  • 8
1

You can simply keep flagging the unused space, and after some time when the internal fragmentation exceeds a certain ratio then you can run a routine which will compact the file. With this scheme the removals would be fast, but some periodic reorganization is needed. If you have a separate file handling scheme, then you can divide the file in some chunks and then keep track of the free chunks and when deleting mark the chunk as unused and keep track of it, and later in the case of an insertion reuse it. This scheme will depend on the type of records in your file, fixed or variable length records.

phoxis
  • 60,131
  • 14
  • 81
  • 117