File structure to avoid data corruption

Question

I´m currently developing an upgrade of our current media storage (To store video/audio/metadata) for a surveillance system and I´m redesigning the recording structure to a more robust solution.

I need to create some index data of the data stored in data files, so I´m creating an index file structure, but I´m concerned with hard disks failure (Imagine if the power is cut during the write of the index file, it will become corrup since the data will most likely be half written). I already designed how the index will be stored, but my concern is relative to data corruption on power failure or disk failure

So, do anyone know techniques to avoid data corruption upon writting?

I already searched a little and found no good solutions, one solution was to create a log of everything that is written to the file, but then I will have many more I/Os per second (I´m concerned with the amount of I/Os per second as well, the system should perform the least as possible).

What I came up with was to duplicate sensitive data in the index file along with a timestamp and checksum fields. For example:

Field1 Field2 Field3 Timestamp Checksum

So, I have the data written twice, if when I read the file, the first set of fields is corrupted (Checksum doesn´t match), I have the second set of fields that should be OK. I believe that corrupion happen when the writting if stopped in the middle, so, for example, when the software is writting the first set of fields and the power failure, the second set is still intact... if the power failures while the second set is being written, the first one is already intact.

What do you guys think of this solution? Does it avoid data corruption?

BTW, I can´t use any kind of database for this kind of storage or transactional NTFS due to the restrictions to deploy a system with transactional NTFS

Any ideas are welcome, thanks!

Use a database rather than reinventing the wheel. You say you can't use a database but that sounds just bogus. Why on earth not? — David Heffernan, Apr 17 '12 at 16:51
I'm with david, use a database is the more realible way, check firebird [How much time it takes to recover a Firebird database after a power failure?](http://www.firebirdfaq.org/faq43/) — RRUZ, Apr 17 '12 at 16:52
Well, in surveillance industry softwares, a database is never used to store video/audio data, it is rather used to store configurations, logs, but not data related to video/audio... Video/audio files must be independent, just like a video file (.AVI, WMV...). A database system also has too much control and overhead for the performance that we need to achieve. If I needed to store regular data I would definitelly go with a database. — Eric, Apr 17 '12 at 17:10
The "log of everything written" is the same as a journaling filesystem: it tells you if there were any incomplete writes. it doesn't mean duplicating everything, in fact, it can simply mean keeping a small second file. but...are you CPU bound? because Reed-Solomon-type algorithms are your best bet if you're worried strictly about I/O and recoverability in case of certain errors — std''OrgnlDave, Apr 17 '12 at 17:14
Well the metadata is a perfect candidate for a database. I know several systems that work that way (meta in DB, content on faulttolerant SAN) — whosrdaddy, Apr 17 '12 at 17:15
Whoa there! Where do video files come into this? You talked about an index and not about video files. Put the index into a lightweight database and then you have resilience against power outage corruptions. Store the video files to the disk as plain files, — David Heffernan, Apr 17 '12 at 17:18
David, that was one of the solutions as well.. the video files are never stored in database anyway... Tipically we have several different video files (Each video file has 50MB of preallocated space) and a group of video files will form 1 hour of recording I need to index each and every frame inside a video file About the database... is Firebird totally resilient against power failure... what happens when the power is cut during a write operation in the database? Already got some "File corruption" in firebird databases in the past, so, I´m just worried in losing everything on file corruption — Eric, Apr 17 '12 at 17:22
Indeed, metadata is a perfect candidate for database. I will consider using a database... but lets just imagine that I can´t use a database. This application is totally I/O bound and the disk usage is really high (There are customers that need to write 800mbs of video per second (100MB)) and all of that information must be quickly indexed and stored, using the minimum I/O operations possible. — Eric, Apr 17 '12 at 17:26
I understand your performance concerns, but I would still take the advice of those above and research database systems first, before reinventing the wheel. Consider the costs of those that will have to support your creation long after you move on :) — John Easley, Apr 17 '12 at 17:33
Any good DB is resilient to power failure. If an index of files isn't an ideal candidate for a db then I don't know what is. — David Heffernan, Apr 17 '12 at 17:38
David Ok, I will research on the use of database file for some of the indexes I need.. but I still have a problem that can´t be solved with the database system... that is I need to index the data inside the recording file (Each video frame) as well, since I can have for example 5000 Frames per second (And this is not too much), I don´t think that a simple database solution can store up to 5.000 records per second with low processor usage and low I/O operations. So I need a specialized index structure inside the recording file as well in order to point to the correct location of the frame — Eric, Apr 17 '12 at 17:44
inside the recording file, and I already have this index structure designed. So, my question is to avoid the corruption of this index data inside the recording data, since some portions of the index (inside the recording file) will be constantly updated I need to avoid the corruption of some data upon a power failure. So, do you think if I duplicate the data I can achieve some sort of protection? — Eric, Apr 17 '12 at 17:46
Another thing.. to use a database it means that I need to deploy the database system to the computers in order to be able to read the database, well.. when our customer need to send video footage to the police for example, it will have a video player included, if I use a database system to read the video files I need to include the database system in a simple video player application, that is not good, even using embedded database (which limits me to just 1 connection) — Eric, Apr 17 '12 at 17:58
5000 records per second? That's a trivial requirement. Anyway, you want to do this yourself and I've made my point, so I guess we agree to disagree. — David Heffernan, Apr 17 '12 at 20:01
There are embedded database systems you can use that won't require database deployment, such as DBISAM. — John Easley, Apr 17 '12 at 20:02
FWIW, I use DBISAM to index data held in external files. In the event that it all goes wrong, I can just wipe the database files and the app will recreate the index again. — mj2008, Apr 18 '12 at 08:26

score 2 · Answer 1 · answered Apr 17 '12 at 18:01

Ignoring the part of your question around not being able to use a database :)

You might find SQL Server 2012's FileTables of interest. You can store the files outside of the database in a folder but still access the files as if they were inside the database. You can use the database to insert new files to that directory or simply copy the file into the folder. Your database won't get really fat with the video files. Nor will they be in-accessible if the db server software went down. Your frame indexing could be individual .jpg files (or whatever) and those, too, could be referenced by a FileTable and index, via a foreign key, to the main video file. The frame index table then is very straight forward.

So you eliminate the DB overhead of writing the file and maintaining the log to see if there was a failure. If the OS can't write the file because of a power failure then the database won't stand a chance. You can do directory comparisons and use a robust utility to move the files around and not to remove the source file if any part of the write fails.

score 2 · Answer 2 · edited Aug 05 '20 at 03:34

2

It does not avoid data corruption, since corruption can happen on any one or both set of fields.

I think you are better without duplicating the "sensitive data" but Writing that data in two steps, on the first step Write the data with "checksum" field empty, and on a second step update the checksum with the one that match the data. This checksum is going to be used as "transaction committed" flag and to ensure data integrity.

When you read data you ignore all sets of the index that are not committed, i mean where the checksum doesn't match.

Then make a lot of testing, and fine tuning, force data corruption on every step of the process, and also save random data. I personally think testing needs a lot of work, since failure is random, that's why people recommend you to use databases tested for years.

Note that while it adds some protection against some kinds of data corruption, it's not perfect and you may add other layers of security to protect your data, including data replication, integrity checks and external configurations including no-breaks, raid systems, periodic backups.

There is too much theory around "transactions".

Search for "atomic transactions algorithms" to get more detail.

Reconsider using database, Reconsider using a log and even reconsider using the file system to store your info.

edited Aug 05 '20 at 03:34

Nick

138,499
22
57
95

answered Apr 17 '12 at 18:46

CesarC

78
5

1

Thank you for your thoughts. I already considered using database, but for the structure that I need, the database usage and limitations on amount of records it can add, and redistribution of a database system on surveillance application is difficult. If the database was the best solution to store this kind of data, all our competiors would use, and that is exactly the oposite, all competitor softwares in this area implements its own recording structure, because that is the way to go for the performance we need – Eric Apr 17 '12 at 19:19
About your solution, the problem is that new records are not appended to the end of the disk... They are stored in a fixed structure and they are constantly modified, my issue is when the data of these indices are modified and we have a power failure during the writting of the data. So, if I update a field and the power failures during this update I would lose the data, if I write it twice I may have a backup of the old state, that for our purposes is all that I need – Eric Apr 17 '12 at 19:22
Reading your post again, just want to make it clear that I know it will not avoid corruption, the data will be corrupted anyway, I cannot avoid corruption I just need a way to easily recover a corrupted data – Eric Apr 17 '12 at 19:31

score 0 · Answer 3 · answered Apr 17 '12 at 18:50

You can use some sort of transaction logic. Create the index in small chunks and first using a temporary file. When you finish one chunk (file), check for integrity and copy it as an actual index file if it passes the test. At this point you can distribute a few copies of the verified chunk.

File structure to avoid data corruption

3 Answers3