AWS EBS snapshot. Does FileSystem consistency REALLY required?

Question

I've been reading a lot about aws ebs and a lot of people seems to encourage people to freeze the filesystem during the snapshot. However, this piece of amazon documentation beg to differ :

While it is completing, an in-progress snapshot is not affected by ongoing reads and writes to the volume.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-snapshot.html

Why does a lot of people freeze the filesystem during the snapshot while aws documentation says that the snapshot is not afected by I/O ?

What if my filesystem is used for a ftp ?

What if my filesystem is used for a database ?

John Hanley · Answer 1 · 2018-02-16T19:06:42.057

Amazon snapshots by themselves are not safe if taken while a system is running. They are safe if you shutdown the system before creating the snapshot. Any file system data that is cached in the operating system's buffers or in an application's buffers (e.g. databases) will not be part of the snapshot. This can lead to unrecoverable corruption.

Both Linux and Windows have OS provided mechanism's to freeze the system (inform the applications to flush their data to disk). Once this is complete a thaw is performed allowing the applications to continue. In between the freeze and thaw, the snapshot is taken. Note: most applications do not support freeze / thaw and a few implement it wrong. Review your vendors documentation carefully.

Another important item is to review where your applications are storing their data. Databases, under best practice design, store their data, logs, etc. on different file systems. This means that you might be starting a snapshot of one volume at a different time to the snapshot of another volume (that may be required as a set to successfully restore the application and its data).

The key is to understand the difference between Crash-Consistent versus Application-Consistent snapshots.

Here is an article on EBS snapshots that explains the differences.

EBS Snapshots: crash-consistent vs. application-consistent

[Update after Michael's comments]

Snapshots implement Copy-on-Write (COW). Once a snapshot has been started, the file system can be modified. If the file system writes to a disk block, the COW subsystem will copy the original block to its internal cache so that the file system can be modified during the snapshot.

It is not necessary keep the file system frozen during the snapshot. During the snapshot creation, the necessary volume data structures are created / copied so that holding the freeze is not necessary. Depending on the system and the amount of data that is cached in memory, the size of the OS and application journals, etc. the freeze / snapshot / thaw cycle can be as quick as a couple of seconds.

Here is an article on various snapshot technologies that include an explanation of Copy-on-Write:

Using different types of storage snapshot technologies for data protection

I think the question may be asking why some people seem to believe it is necessary to keep the filesystem frozen for the *duration* of the snapshot process. According to the docs, once a snapshot has started (though it isn't precisely clear whether that means once the initial API request returns or something slightly later), EBS has some kind on an internal copy on write (?) implementation that allows it to snapshot the data as it stood on disk when the snapshot began... meaning that it is safe to unfreeze immediately after starting the snapshot, no need to wait until the end. — Michael - sqlbot, Feb 16 '18 at 12:42

shodanshok · Accepted Answer · 2018-02-22T21:51:44.777

Short answer: it really depends on the kind of application you are running on your instance.

Long answer: Basically, taking an unquiesced snapshot of a running machine is similar to "pull the power plug" - ie: a sudden, immediate, unexpected crash.

When running with I/O barrier enabled, modern journaled filesystem should be consistent in spite of any crash. This does not means that in-memory data will not be lost; rather, that commited data are guaranteed to be stored on persistent storage (ie: disk).

This really applies to any properly journaled application, especially ACID-compliant databases (a non inclusive list: MSSQL, InnoDB, PostgreSQL, Oracle, IBM DB2, ecc). Again, this does not means that a sudden power loss (or a restored, not-quiesced snapshot) will not lead to any data loss; rather, it means that when a (possibly implicit) COMMIT returns, any relevant data are on stable storage.

With such journaled application, you don't strictly need to quiesce the filesystem. The first boot after a restored snapshot, the system will reply its journals (filesystem and databases) and a consistent state will be reached.

However, there are many applications that do not properly journal their updates, and which require the equivalent of a fsck to return to a consistent state. The main example is MySQL+MyISAM: this (very common) database engine is not ACID compliant, as its great write speed is achieved by batching unrelated I/O operation with small regard for regular I/O barriers. An unproperly shutdown (ie: power loss, system or mysql crash, unquiesced snapshot) MyISAM database can be inoperable until a mysqlcheck/mysqlrepair is performed.

The various guide recommending to quiesce the filesystem before a snapshot do that for this exact reason: some "unprepared" application (read: MyISAM) can be somewhat damaged by the sudden shutdown and subsequent restore, requiring a consistency check.

Bottom line: if you use a journaled filesystem with enabled I/O barriers (default on ext4 and XFS) and an ACID-compliant database, you should be safe taking unquiesced snapshots. At worst you can see some non-fatal error/warning when mounting the snapshot, but journal reply will bring the system in a consistent state. If using MyISAM, however, it is better to freeze/quiesce your filesystem before taking a snapshot.

AWS EBS snapshot. Does FileSystem consistency REALLY required?

2 Answers2