So, I am trying to create a database that can store thousands of malware binary files with sizes ranging anywhere from kb's to 50 mb. I am currently testing with cassandra using blobs but of course with files that big cassandra isn't handling it that well. Does anyone have any good ideas maybe for a better database or maybe a better way to go about using cassandra. I am relatively new to databases so please be as detailed as possible. Thank You
-
1But why do you need database to store actual content? Why not store data on file system, or use something like S3 (by functionality, not as service) & use DB only to store metadata? Please add more requirements on how data will be accessed, what should be stored besides sample itself, etc. – Alex Ott Jul 12 '18 at 16:45
-
So for our project we are taking a large data set of malware and we want to load it onto the database. The malware is split up into 9 different categories where we want to grab the file through the database. From what you are saying it would make more sense to store it in a file system and use the database for a reference to each of the files. – Justin Jul 12 '18 at 17:30
-
Not necessary a file system, but maybe alternatives to S3, like discussed here: https://news.ycombinator.com/item?id=16627370 – Alex Ott Jul 12 '18 at 18:25
-
Hi Justin, this question is off-topic at Stack Overflow as it is asking for software/technology/tool recommendations. Please consider asking on [Software Recs](https://softwarerecs.stackexchange.com/) instead. – TylerH Jul 12 '18 at 18:44
1 Answers
If you have your heart set on cassandra you would want to store the blob files outside cassandra as the large file sizes will cause problems with your compaction and repairs. Ideally you would store the blob files on a network store somewhere outside cassandra. That said apparently walmart did do it previously
Cassandra setup:
CREATE TABLE [IF NOT EXISTS] malware_table (
malware_hash varchar,
filepath varchar,
date_found timestamp,
object blob,
other columns...
PRIMARY KEY (malware_hash, filepath)
What we're doing here is creating a composite key based on the malware hash. So you can do SELECT * FROM malware_table WHERE malware_hash = ?
. If there was a collision you have two files to look at. Additionally this lookup will be super fast as its a key value lookup. Keep in mind with cassandra you can only query by your primary key.
As its not likely that you're going to be updating files in the past you're going to want to run Size based compaction. For faster lookups in the long run. This will be more expensive on harddrive space since you'll need to have 50% of your harddrives free at any given time.
Alternative solution:
I would probably store this in s3/gcs or some network store. Create a folder to represent the hash of the folder and then store the files inside each folder. Use the api to determine if the file is there. If this is something being hit 1000s of times a second you would want to create a caching layer infront of it to reduce lookup times. The cost of a object store is going to be VASTLY cheaper than a cassandra cluster and will likely scale better.

- 2,291
- 3
- 26
- 30
-
-
Free? You might be able to leverage dropbox api. Alternatively you could create a mock interface infront of your local storage or network storage that is similiar to s3/gcs apis so when you're ready to charge money tis an easy migration. – Highstead Jul 12 '18 at 19:43