Recommendation for how to best handle information for random access?

Question

Imagine you have a filesystem tree:

root/AA/aadata
root/AA/aafile
root/AA/aatext
root/AB/abinput
root/AB/aboutput
root/AC/acinput
...

Totally around 10 million files. Each file is around 10kb in size. They are mostly like a key-value storage, separated by folders just in order to improve speed (FS would die if I put 5 million files in a single folder).

Now we need to:

archive this tree into a single big file (it must be relatively fast but have a good compression ratio as well - thus, 7z is too slow)
seek the result big file very quickly - so, when I need to get the content of "root/AB/aboutput", I should be able to read it very quickly.

I won't use Redis because in the future the amount of files might increase and there would be no space for them in RAM. But on the other side, I can use SSD-powered servers to the data access will be relatively fast (compared to HDD).

Also it should not be any exotic file system, such as squashfs or similar file systems. It should work in an ordinary EXT3 or EXT4 or NTFS.

I also thought about storing the files as a simple zlib-compressed strings, remembering the file offset for each string, and then creating something like a map which will be in RAM. Each time I need a file, I will read the content offset form the map, and then - using the offsets - from the actual file. But maybe there is something easier or already done?

@RobertHarvey I'm looking for 1) best solution for Python, 2) algorithm\solution to store those files efficiently. In other words, I don't exactly need a file format to store everything - I need a solution to read them as well. — Spaceman, Mar 10 '14 at 17:13
How often will the files be changed? Are the files all of different file lengths -- what is the full range of file sizes? What other characteristics can you suggest that will help us figure out how to help you? What is the compression rate average for the files using Winzip or 7z? — ErstwhileIII, Mar 10 '14 at 17:15
@RobertHarvey: This question is at least about "a software algorithm" and "software tools commonly used by programmers", but touches other fields listed in http://stackoverflow.com/help/on-topic. — Sven Marnach, Mar 10 '14 at 17:16
@ErstwhileIII this is read-only 'file'... It's not supposed to be changed, only rebuilt from scratch when needed (probably 2 times a month). File size varies from just a few bytes to - let's say - 100kb but the average is around 10kb. This is mostly a text data, not a binary one. — Spaceman, Mar 10 '14 at 17:17
Related: http://stackoverflow.com/q/1148122. Both posts validate your current approach, which is to break up your file pile into several folders, and write code to retrieve each file from the correct folder. What prevents you from continuing to do that? — Robert Harvey, Mar 10 '14 at 17:18
What is the program that accesses these files (web w/python; standalone application, ??) You also have some concern about the amount of storage you are using for all the data? (Storage is often the least cost element of a solution.) — ErstwhileIII, Mar 10 '14 at 17:20
I guess the question is, why do you need the archive, when you already have a perfectly sensible way to retrieve the data now? Do you need to be able to transport the archive from one machine to another? Is this about space concerns? NTFS allows you to compress folders. — Robert Harvey, Mar 10 '14 at 17:22
The archive will be created on an SSD machine. But later it will be transferred to a server with HDD (they are much cheaper). After some tests, I realized that storing more than 100 millions on a plain filesystem is a bad thing: big overhead, low speed, hard to back-up, error rate increase (while it's hard to validate each file - must open and read each). So the single whole file looks better. — Spaceman, Mar 11 '14 at 02:20
ok guys sorry for the typos and mistakes - 'English is not my native... blabla', and I was in a hurry when I was writing the post =) Also - thanks to @RobertHarvey for changing the topic title, probably it's more precise now — Spaceman, Mar 11 '14 at 14:15
@ErstwhileIII this is a web app (for reading) and a console worker script for writing (~creating). The storage is not a big problem, terabytes of HDD space are relatively cheap now. — Spaceman, Mar 11 '14 at 14:20

score 0 · Answer 1 · answered Mar 10 '14 at 17:47

Assumptions (from information in contents). You might use the following strategy: use two files (one for the "index", the second for your actual content. For simplicity, make the second file a set of "blocks" (say each of 8196). To process your files, read them into an programmatic structure for file name (key) along with the block number of the second file where the content begins. Write the file content to the second file (compressing if storage space is at a premium). Save the index information.

To retrieve, read the index file into programmattic storage and store as a binary tree. If search times are a problem, you might hash the keys and store values into a table and handle collision with a simple add to next available slot. To retrieve content, get block number (and length) from your index find; read the content from the second file (expand if you compressed).

well, what you have described is something not really simple and it looks more like a small DB engine than a simple solution... Yes your idea is nice and I would probably agree with you, but I hope there's another solution around here that will be convenient and simple to work with. — Spaceman, Mar 11 '14 at 14:18

Recommendation for how to best handle information for random access?

1 Answers1