2

I have a big file (several gigabytes), and I want to update a small section in it (overwrite some bytes with a new value). This must be done atomically (either the operation succeeds, or the file is left unchanged). How can I do that?

The purpose is to store progress information in a file that takes a lot of time to generate/upload (it can be on a remote file system). There will probably be times where I need to write at different locations in the file (and commit all changes at once), but if needed I can rewrite the whole index, which is a contiguous block and relatively small compared to the rest of the file. There is only one process and thread writing to the file at any given time.

youen
  • 1,952
  • 1
  • 23
  • 32
  • You're planning on using node.js and expecting atomic updates to remote filesystems, up to and including "write at different locations in the file (and commit all changes at once)"? Even for local filesystems, node.js isn't able to provide [those kinds of guarantees](https://en.wikipedia.org/wiki/ACID_(computer_science)). – Andrew Henle Aug 27 '18 at 11:13
  • Can you do all reads before all writes? – m1ch4ls Aug 27 '18 at 17:38

2 Answers2

1

Normal disks are not transactional, and don't provide atomicity guarantees. If the underlying file system doesn't provide atomic writes (and most of them don't), then you'll need to create atomicity in your own application/data structure. This could be done via journaling (as many file systems and databases do), copy-on-write techniques, etc.

In Windows, the Transactional File System (TxF) feature does exactly what you need - but your application will need to explicitly use the Win32 transactional file I/O APIs to do that.

M.A. Hanin
  • 8,044
  • 33
  • 51
  • Indeed, I've come to the conclusion that writing my own transaction log (journaling system) is the best solution for my case. I've opted for a fixed-size generic update log, coupled with two integrity hash codes : one for the actual data, and one for the transaction log. The file is valid if at least one hash is (which is guaranteed by the fact I finish writing one before writing the other). If the actual data does not pass the hash check, I read the log to restore the previous version (which must then pass the hash check). I accept your answer since the other suggestions were "only" comments. – youen Aug 31 '18 at 12:42
0

I think simple lockfile should be enough...

For example proper-lockfile:

const lockfile = require('proper-lockfile');

lockfile.lock('some/file')
  .then(() => doStuff())
  .finally(() => lockfile.unlock('some/file'));

Note that any logic working with some/file has to respect the lockfile.

m1ch4ls
  • 3,317
  • 18
  • 31
  • Thanks for the answer. I don't have a locking problem, I have an atomicity problem. I want to prevent my file from becoming corrupt when I update it (each atomic update must result in a valid file). Potential problems would be a crash (of the process or the whole server), a reboot, network issue for remote file systems, etc. On the other hand, I *don't* have concurrent processes trying to read or write to the file. – youen Aug 25 '18 at 17:34
  • I see, this is more complicated then... I don't know how to ensure this on the system level - I don't think it's possible. You will probably have to write [transaction log](https://en.wikipedia.org/wiki/Transaction_log) and implement rollbacks yourself. I'll leave the answer here as it is and I'll think about the issue some more - if I come up with something, I'll post an update. – m1ch4ls Aug 25 '18 at 17:43
  • Thanks for the transaction log pointer ; it seems indeed to be a way to solve this problem. But maybe too complicated in my case. In the end maybe I'll drop the idea of a single file, put my progress data in a separate file, and overwrite it completely when needed (with an atomic file move). – youen Aug 25 '18 at 20:13
  • (write to temp file + rename ) is the easiest way of doing this. You can use a reflinked file for the temporary copy that you want to update, if your file system supports reflinks (XFS/ BTRFS). That can save you some disk space. – itisravi Aug 27 '18 at 04:31
  • Thanks for the reflink pointer. I did know about copy-on-write, and that it could solve my problem, but didn't search yet if that was possible. At this time I'm using sshfs to access the file, but I'd like my solution to stay as generic as possible, so I'm not exploring this possibility right now. I've started to work on a simple transaction log system, I think it'll be more portable. – youen Aug 31 '18 at 12:35