Blob data in huge SQL Server database

Question

We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).

We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.

I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.

The database will be grow huge over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.

I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?

Any other suggestions on how to manage such data volumes?

score 1 · Answer 1 · edited May 23 '17 at 12:04

1

I'd say keeping the files in the filesystem would be a better idea. And you can keep file name and path in the DB. Here's a similar question.

edited May 23 '17 at 12:04

Community

1
1

answered Jun 23 '11 at 12:36

Eugene Mayevski 'Callback

45,135
8
71
121

What's he do when someone moves a file without logging it into the database? – JNK Jun 23 '11 at 13:28
@JNK See similar question discussion for answers to your question. – Eugene Mayevski 'Callback Jun 23 '11 at 16:01

score 1 · Answer 2 · answered Jun 23 '11 at 17:57

Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.

I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.

Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.

Hi, thanks for your reply. We expect searching and opening blob data is very infrequent. It will happen, but not very often. It's more about how to allocate storage on physical disks or SAN, etc. I will definitely look into filegroups and TEXTIMAGE_ON. — Poppert, Jun 27 '11 at 07:30

score 0 · Answer 3 · answered Jun 23 '11 at 12:48

I assume by "generated" you mean something like data are being injected into document templates, and so there's much repetition of text content, i.e. "boilerplate" ?

20 million of such "generated" files per year is ~55,000 per day, ~2300 per hour!

I would manage such volume by not generating text files in the first place, and instead by creating database abstracts that contain the data that are pumped into the generated text, so that you can reconstitute the full document if necessary.

If you mean something else by "generated" could you elaborate?

Hi, thanks for your response. The textfiles are generated for a purpose. Not by us, it's out of our control. We have them and we have to deal with it. — Poppert, Jun 23 '11 at 13:55

Blob data in huge SQL Server database

3 Answers3

Linked